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Abstract 

Markov chain Monte Carlo (MCMC) lies at the core of modern Bayesian methodol¬ 
ogy, much of which would be impossible without it. Thus, the convergence properties 
of MCMCs have received significant attention, and in particular, proving (geometric) 
ergodicity is of critical interest. Trust in the ability of MCMCs to sample from modern- 
day high-dimensional posteriors, however, has been limited by a widespread perception 
that chains typically experience serious convergence problems in such regimes. Though 
there may be a good practical understanding of convergence problems (and the asso¬ 
ciated role of priors) in some settings, a clear theoretical characterization of these 
problems is not available. Current methods for obtaining convergence rates of such 
MCMCs typically proceed as if the dimension of the parameter, p, and sample size, n, 
are fixed. In this paper, we first demonstrate that contemporary methods have serious 
limitations when the dimension grows. We then propose a framework for rigorously 
establishing the convergence behavior of commonly used high-dimensional MCMCs. In 
particular, we demonstrate theoretically the precise nature and severity of the conver¬ 
gence problems of popular MCMCs when implemented in high dimensions, including 
phase transitions in the convergence rates in various n and p regimes. We then proceed 
to show a universality result for the convergence rate of MCMCs across an entire spec¬ 
trum of models. We also show that convergence problems in some important models 
effectively eliminate the apparent safeguard of geometric ergodicity. We then demon¬ 
strate theoretical principles by which MCMCs can be constructed and analyzed to yield 
bounded geometric convergence rates (essentially recovering geometric ergodicity) even 
as the dimension p grows without bound. Additionally, we propose a diagnostic tool 
for establishing convergence (or the lack thereof) for high-dimensional MCMCs. 



1 Introduction 


Markov chain Monte Carlo (MCMC) is an indispensable tool that has enabled much of mod¬ 
ern Bayesian inference, and advances in MCMC have revolutionized Bayesian methodology 
in recent decades (see Diaconis, 2009, for an overview). The rise of MCMC has been aided by 
the steady increase in computing capabilities, which has enabled many complex and sophisti¬ 
cated MCMC techniques. Thus, modern MCMC allows consideration of Bayesian posteriors 
for which no closed-form inferential solutions can be obtained. The applicability of MCMC 
to a wide range of problems has enabled an “honest exploration” of the Bayesian posterior 
(Jones and Hobert, 200’ ). 

An enormous amount of effort has been invested in establishing convergence properties 
of Markov chains. The basic question that most such work seeks to answer is the question 
of how long the chain must be run in order to approximate posterior quantities of interest 
to a desired precision. To this end, a primary goal is typically to show that chains arising 
in commonly used Bayesian methods are geometrically ergodic. A general approach for 
establishing geometric ergodicity was provided by Rosenthal (1995), and many subsequent 
results have been based on the method that Rosenthal laid out. Further details can be found 
in the work of Meyn and Tweedie (1993), Gilks et al. (1995), Jones and Hobert (2001), Flegal 
et al. (2008), and the references therein. 

Modern high-dimensional settings have created new challenges when considering the lim¬ 
iting properties of inferential procedures. Statistical theory has traditionally considered 
regimes in which the sample size n is large and the number of parameters p is small. How¬ 
ever, there is now much interest in so-called “small n, large p” or “large n, large p” settings, 
and considerable advances have been made toward asymptotics in various sample complexity 
regimes (see, e.g., Hero and Rajaratnam, 2015, 2016, for an overview). Bayesian inference 
enjoys certain advantages in such high-dimensional settings. Bayesian procedures often yield 
natural ways to undertake regularization and provide straightforward quantification of un¬ 
certainty. For both Bayesian and frequentist inference, substantial attention has been paid 
to two different types of complexity in high-dimensional regimes. The first type, compu¬ 
tational complexity, considers the computing time and resources that are required for the 
execution of an inferential algorithm. The second type, sample complexity, deals with the 
fundamental ability to recover an underlying signal in various n and p regimes. However, a 
third type of complexity is also of vital importance for modern MCMC schemes involving 
large numbers of parameters. This concept, which we call convergence complexity , is an issue 
that is unique to Bayesian inference. More precisely, convergence complexity considers the 
ability of an MCMC scheme to draw samples from the posterior, and how the ability to do 
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so changes as the dimension of the parameter set grows. Although MCMC is perhaps the 
most important tool of modern Bayesian inference, to our knowledge a systematic theoret¬ 
ical treatment of the convergence complexity of modern Markov chains in various n and p 
regimes is not available. 

The need for such an investigation also stems from the perceived scalability (or lack 
thereof) of Bayesian inferential methods to modern high-dimensional settings. It is well 
understood that approaches based on l\ or lasso regularization have enabled frequentist 
approaches to be scaled to high-dimensional settings. However, despite heroic efforts from 
the MCMC community, there is a still a widely held perception that scaling MCMCs to 
modern high-dimensional settings is simply not feasible. The end result is that the benefits 
of posterior inference are lost (especially the ability to readily quantify uncertainty). Having 
said this, there is however a general understanding among practitioners that scaling classical 
MCMCs to very high dimensions can be problematic and that prior specification can play 
a role in convergence issues. Thus, we believe that a general framework for undertaking a 
theoretical analysis of high-dimensional MCMCs in various n and p regimes is long overdue, 
since it is vital to understand the effectiveness of using MCMCs as a tool to draw from 
high-dimensional posteriors. 

In this paper, we undertake a detailed investigation of the convergence complexity of 
modern MCMCs that form the basis of more sophisticated models in many applications (see 
Gelman et ah, 2013; O’Hagan and Forster, 2010; and other standard Bayesian texts for con¬ 
crete examples). Specifically, we first study Markov chains associated with a Bayesian anal¬ 
ysis of the standard regression model and extensions thereof. These extensions include the 
Bayesian lasso, the Bayesian elastic net, and the spike-and-slab approach. We demonstrate 
that for Markov chains associated with standard regression-type models (and extensions 
thereof), the apparent theoretical safeguard of geometric ergodicity is merely an illusion if 
the dimension p grows faster than the sample size n. More precisely, although the chain is 
indeed geometrically ergodic for any fixed n and p, we show that the rate constant r = r np 
tends to 1 if p grows faster than n. Thus, the convergence of these Markov chains may 
still be quite slow in modern high-dimensional settings. Our results also carry over directly 
to graphical models. We then contrast this convergence complexity with that of chains of 
other popular models, including the class of hierarchical models and the multivariate mean 
model. We demonstrate that fortunately and contrary to perception, convergence behavior 
seen in high-dimensional regression models is not inherent to many commonly used high¬ 
dimensional Markov chains. Though it is not possible to analyze all models and various prior 
specifications, the spectrum of models we do consider gives general and compelling insights 
into convergence behavior. 
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In all the models we consider, we are able to obtain exact or sharp convergence rates for 
various Markov chains using novel technical approaches. The significance of doing so is better 
understood by first recognizing that establishing geometric ergodicity itself is considered a 
challenging task and is often undertaken on a case-by-case basis for various MCMCs. Thus 
we believe that the ability to obtain sharp results for the geometric convergence rate in 
terms of n and p constitutes a significant step forward in understanding the behavior of 
high-dimensional MCMCs. It also simultaneously delivers novel theoretical methods for 
deriving such convergence rates. 

The remainder of the paper is organized as follows. Section 2 contains a discussion of 
known results for Markov chains and considers these results in high-dimensional settings. 
Section 3 provides a rigorous consideration of high-dimensional convergence problems in the 
Bayesian regression framework. In Section 4, we investigate extensions of standard Bayesian 
regression, including the Bayesian lasso, Bayesian elastic net, and spike-and-slab regression. 
In Section 5 we consider the multivariate Gaussian mean model. In Section 6 we consider 
normal hierarchical models with known and unknown variances. Additionally, we propose a 
diagnostic tool for assessing convergence in various n and p regimes. Section 7 demonstrates 
how convergence rates that are uniformly bounded away from 1 may be obtained theoretically 
for high-dimensional Markov chains. Further discussion and conclusions are presented in 
Section 8. 


2 Preliminaries 

In this section, we present some preliminary results on the behavior of Markov chains. First, 
we review notions of Markov chain convergence and associated convergence rates, along 
with methods by which such properties can be rigorously established. We then consider 
Gibbs sampling and relevant properties of the joint and marginal chains that arise from such 
schemes. Next, we discuss the role of autocorrelation in Gibbs sampling and its relationship 
to a chain’s overall convergence behavior. Finally, we introduce the concept of convergence 
complexity , by which we mean the dependence of the chain’s geometric convergence rate r on 
the sample size n and the dimension of the parameter p. We introduce examples to illustrate 
this concept and to motivate the work in the remainder of the paper. 

2.1 Convergence Rates and Geometric Ergodicity 

The total variation distance between two probability measures P and Q defined on the 
same cr-algebra T is cItv(P, Q ) = sup^gj-1 P(A) — Q(A )|. In terms of Markov chains, if P£ 


3 


denotes the distribution of the kth iterate of a Markov chain with starting point x 0 and fl 
denotes the chain’s stationary distribution (i.e., the target posterior), then we are typically 
interested in d,Tv(P£ 0 , If). It is typically desirable for the distance dTv(-P® 0 ,hl) to converge 
to zero at a geometric rate, i.e., that there exist M xo > 0 and 0 < r < 1 such that 

(2.1) 

for every k > 1. When such constants exist (and provided certain other regularity conditions 
hold), the Markov chain is said to be geometrically ergodic. 

An active area of current research is the establishment of geometric ergodicity for Markov 
chains commonly used in applied Bayesian statistics. Rigorous proofs of such results can be 
challenging to obtain, and different models and sampling schemes must often be handled on 
a case-by-case basis (see in particular the rich array of results established by the work of 
J. Hobert and co-authors). Although a variety of methods may be used to prove geometric 
ergodicity (see, e.g., Meyn and Tweedic, 1993), these methods often establish the existence of 
a constant 0 < r < 1 satisfying the geometric bound in (2.1). More sophisticated techniques 
are typically needed to find quantitative bounds on the geometric convergence rate. The 
most widely employed approach for finding such bounds has been the method set forth by 
Rosenthal (1995). This method proceeds by establishing a drift condition and an associated 
minorization condition for the Markov chain in question. Let (Xf. : k > 0) be a Markov 
chain with state space X C and associated Borel a-algebra B. We assume the Markov 
chain satisfies certain regularity conditions, e.g., those of Jones and Hobert (200 ). Let P x 
denote its transition kernel, i.e., P X (A ) is the probability that X !+ i G A G B given that 
Xi = x. Let n denote the stationary distribution of the chain. The chain satisfies a drift 
condition if there exist a function V : X —y [0, oo) and constants 0 < A < 1 and b < oo such 
that 

J V dP x = E[V{X i+1 ) | Xi = x] < A V(x) + b for all xeX. (2.2) 

The chain satisfies a minorization condition if there exist a probability measure Q on B. a 
set C with n(C') > 0, and a constant e > 0 such that 

P X (A) > eQ(A) for all x G C and all A G B. (2.3) 

The establishment of geometric ergodicity requires that the set C be chosen specifically as 
C = {x G X : V{x) < d} for some d > 26/(1 — A). Jones and Hobert (: ) provide an 
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accessible conceptual discussion of the connections between these conditions and geometric 
ergodicity. 

The convergence rates of Markov chains can also be investigated using tools and tech¬ 
niques from functional analysis. (See Liu et ah, 1994; Liu, 1994; and the references therein 
for further details.) Let X denote the state space of a Markov chain (X k : k > 0) with 
stationary distribution II, and let Lq(II) denote the space of all functions h : X — > M. such 
that E[h(X)] = 0 and Var[/i(X)] < oo where X ~ LL For any function g e Lq(II), its 
norm ||g|| is defined as the square root of ||g|| 2 = Flj^X)] 2 } with X ~ II. Now define the 
forward operator F mapping Lq(L 1) to itself by 

Fg(x) = EMXJ | X 0 = ®]. 

The norm of the operator F is defined as ||-F|| = sup|| 9 |i_ 1 ||.Fg||, and its spectral radius 
is Tp = linifc^oo ||.F fc || 1//fe , noting that the fc-step forward operator F k is simply F k g{x ) = 
E\g(X k ) j X 0 = x\. If the Markov chain is reversible, then F is self-adjoint. It follows 
that the norm |_F||, spectral radius rp, and largest eigenvalue of F all share a common 
value r. Moreover, under certain regularity conditions, the chain is geometrically ergodic 
with geometric rate constant r if r < 1 ( 4u et ah, 1994, 1995; Liu, 200-!). 

2.2 Gibbs Sampling and Marginal Chains 

Many general techniques have been developed for constructing Markov chains to sample 
from a target posterior, such as the accept-reject algorithm and the Metropolis-Hastings 
algorithm ( Jetropolis et ah, 1953; Hastings, 1970). Many of these methods are based on 
proposing a new point and then either accepting or rejecting it with some probability. For 
such methods to obtain reasonably large acceptance probabilities in high-dimensional set¬ 
tings, they must propose points that are very close to the chain’s current state, which in turn 
limits their ability to quickly traverse the state space (see, e.g., the work on optimal scaling 
of Roberts and Rosenthal, 2001; Beskos and Stuart, 2009; and the references therein). 

However, one special case of the Metropolis-Hastings algorithm that is quite useful in 
high dimensions is known as the Gibbs sampler ( deman and Geman, 198-!). By construction, 
Gibbs samplers propose a new point in such a way that the acceptance probability is 1. 
Thus, they are very useful for tractably sampling from the posterior in high-dimensional 
settings. Moreover, a preponderance of theoretical convergence results establishing geometric 
ergodicity for specific MCMC schemes are for Gibbs samplers. Indeed, the machinery by 
which these theoretical results are established (such as the method of Rosenthal, 1991 ) 
is inherently better suited to Gibbs sampling than to other approaches (see, e.g., Choi and 
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Hobert, 2013; Khare and Hobert, 2013; Roman and Hobert, 2015; and the references therein). 
Thus, let {( X k ,Y k ) : k > 0} be a Markov chain constructed as a Gibbs sampler that 
alternates between drawing X and Y and has (joint) stationary distribution II. It is well 
known that the marginal sequences {Ac : k > 0} and {Y^ : k > 0} are reversible Markov 
chains (e.g., Liu et ah, 1994). Moreover, it can be shown that either all three chains are 
geometrically ergodic with the same rate or none of the chains are geometrically ergodic 
( .iu et al., 1994). Thus, to establish the geometric convergence rate of the joint chain, it 
suffices to find the largest eigenvalue of the forward operator of either marginal chain (or 
to fold the convergence rate of the marginal chain by some other method). This approach 
can simplify proofs of geometric convergence rates if one of the marginal chains is more 
analytically tractable than the joint chain. 

2.3 Autocorrelation Structure 

Even if a Markov chain is approximately sampling from its stationary distribution, the 
draws are not approximately independent (in general). The autocorrelation strutcture be¬ 
tween successive iterates can thus be of great importance to the MCMC practicitioner when 
considering questions such as the amount of error inherent to the MCMC samples, i.e., how 
much an MCMC approximation can be expected to differ from the corresponding “true” re¬ 
sult. From a more theoretical perspective, the autocorrelation structure of the chain is also 
of interest due to its connections to other properties of the chain, including its convergence 
properties. Indeed, it is intuitively clear that the greater the correlation between successive 
iterates, the more iterations it should take for the effects of the starting point (or starting 
distribution) to “wash out.” 

To properly state results on the autocorrelation structure of Markov chains, we first 
introduce a slightly more general notion of correlation. The maximal correlation between 
two random variables A and As (with some joint distribution) is defined as 

q(A, A) = sup Corral (A), 0 2 (A)], 

91,92 

where the supremum is taken over all functions g\ and 02 such that the variances Var[<?i(A)] 
and Var[ 02 (A)] are finite and nonzero. If (Y k : k > 0) is a stationary Markov chain with 
Y k ~ II, then the norm of its forward operator F can be shown to be equal to the maximal 
correlation between successive iterates, i.e., 

\\F\\ ='y(Y k ,Y k+1 ) 
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( fiu, 199 ). Now consider the specific case of a two-step Gibbs sampler to draw from 
some posterior distribution 7 r(0, 0 | Z), where 9 and 0 represent unknown parameters and 
Z represents observed data. This Gibbs sampler draws a sequence of iterates (9k,<t>k) by 
drawing alternately from the conditional posterior distributions 7r(0 j 0, Z) and 7 r(0 | 9 , Z). 
Suppose the chain is stationary, and let 7 ( 0 , 0 | Z) denote the maximal correlation between 
9 and 0 under the joint posterior. Then the forward operators of the joint and marginal 
Gibbs sampling chains all have spectral radius equal to the square of the maximum posterior 
correlation as given by [ 7 ( 0,0 | Z)] 2 ( el ah, 19! ). 

2.4 Convergence Complexity 

It is obviously extremely useful to show that any given Markov chain is geometrically ergodic. 
Still, a full characterization of the behavior of the chain cannot be reduced to simply the 
binary question of whether a chain does or does not have this property. Even if a chain is 
geometrically ergodic, the specific value of the geometric rate constant r in the bound in (2.1) 
can be of great practical importance, especially in ultra-high-dimensional applications. More 
specifically, if r is very close to 1, then a chain may still converge quite slowly despite the fact 
that it is geometrically ergodic, a fact that has been noted in the literature (Papaspiliopoulos 
et al., 2007; Papaspiliopoulos and Roberts, 2008; Woodard and Rosenthal, 2013). Of course, 
a value of r close to 1 would immediately raise the question of the sharpness of the associated 
inequality, i.e., whether the bound in (2.1) could be satisfied with some smaller choice of r. 
However, such questions regarding the convergence complexity of r may be difficult to answer 
when existing methods provide only upper bounds. 

More generally, in modern applications, various notions of complexity are often of interest. 
Practical limitations of computing time have motivated the consideration of computational 
complexity, and fundamental questions of signal recovery have led to the investigation of 
sample complexity (see Hero and Rajaratnam, 2015, 2016, and the references therein). For 
MCMC-based inferential procedures, the convergence complexity of the Markov chain in 
various n and p regimes is an important issue that warrants attention. In the context 
of Markov chain convergence, some authors have investigated the relationship between a 
chain’s convergence behavior and the sample size of the data on which the target posterior 
is conditioned (see Mossel and Vigoda, 2006; Papaspiliopoulos et ah, 2007; Woodard and 
Rosenthal, 2013; and the references therein). However, in modern high-dimensional statistics, 
there is also great interest in the behavior of the chain as the dimension of the unknown 
parameter vector grows without bound. If an MCMC scheme for a Bayesian method is based 
on a Markov chain that is geometrically ergodic, then a key question of practical significance 
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is how the associated rate constant r niP behaves in various n and p regimes. More specifically, 
a key question for any particular asymptotic regime is whether r n:P —y 1, or equivalently, 
whether the number of iterations required for approximate convergence (to within some fixed 
distance e of the stationary distribution) tends to infinity as n or p tends to infinity. An 
answer in the affirmative would suggest that the Markov chain could converge quite slowly 
in such a regime despite the apparent theoretical safeguard of geometric ergodicity. Note 
also that the constant M Xo in (2.1) can be disregarded in an asymptotic analysis as the 
geometric component r k drives the convergence rates. It is clear that r k is the leading term 
in the bound. 

It may seem that an approach to answering the question of convergence complexity may 
be provided by the method of Rosenthal (1995). Since this method is commonly used to 
obtain an upper bound for the geometric convergence rate r, it is natural to ask whether 
this bound can be directly analyzed in various n and p regimes. Somewhat problematically, 
such upper bounds may tend to 1 as n or p tends to infinity. We illustrate the behavior of 
these upper bounds in the following examples. 

Example 2.1. Consider a Bayesian analysis of the logistic regression model 

Yi j (3 ~ ind. Bin[l, logit -1 (xj0)~\ for each i 6 {1,..., n}, 

(3 ~ Np(O p , A -1 I P ), 

where aq,..., x n G and A > 0 are known, and where logit -1 (u) = e“/(l + e u ). A Gibbs 
sampler to draw from the posterior of a slightly more general version of this construction was 
developed by Poison et al. (2013). Choi and Hobert (2013) used the method of Rosenthal 
(199 ) to prove that this Gibbs sampler is geometrically ergodic (in fact, uniformly so) with 
a convergence rate bounded above by the quantity r = 1 — <5, where 

5 = A p/2 (detA) -1/2 2 -n exp 

where A = A X T X + A I p and Y = Y — |l n (see Proposition 3.1 of Hobert, 2013). 

Thus, the results of 'hoi and Hobert (2013) essentially establish the upper bound 

d T v(G fc , G) < Mf k = M (1 - 5) k (2.4) 

for some M > 0, where Gk denotes the distribution of the kth iterate of the joint chain and 
G denotes the corresponding stationary distribution. The following lemma establishes the 


n 

4 


1 

4A 


XA~ 1/2 X t Y 






behavior of r = 1 — 5, and hence the behavior of the upper bound in (2.4), as n or p grows. 
Its proof and all subsequent proofs are provided in the supplemental sections. 

Lemma 2.2. Consider the upper bound r = 1 — 5 provided by ' hoi and Hobert (2013) for 
the convergence rate of the logistic regression Gibbs sampler in Example 2.1. Then r —> 1 
exponentially fast as n —> oo (for fixed p) and as p —* oo (for fixed n). 

Thus, if either n or p tends to infinity, then the upper bound on the convergence rate 
tends to 1, and it does so exponentially fast. The apparent safeguard of geometric ergodicity 
is therefore misleading in high-dimensional applications when either n or p is very large. 

Example 2.3. Consider the Bayesian lasso framework of kirk and Casella ( !008): 

Y | (3,a 2 ,T~N n (X(3, a 2 I n ), 

/3 | a 2 ,r ~ N p (Op, cr 2 D T ), 

7r(cr 2 ) oc 1/a 2 , 

Tj ~ iid Exp(A/2) for each j 6 {1,... ,p}, 

where D T = Diag(ri,..., r p ). Park and Case! la ( ) provide a Gibbs sampler to draw from 

the posterior corresponding to the Bayesian lasso. Ivhare and Hobert (2013) demonstrated 
a useful result that this Gibbs sampler is geometrically ergodic with an upper bound r for 
its geometric rate constant. The following lemma establishes the asymptotic behavior of r. 

Lemma 2.4. Consider the upper bound r provided by Khare and Hobert (2013) for the con¬ 
vergence rate of the Bayesian lasso Gibbs sampler in Example 2.3. Then f —» 1 exponentially 
fast as n —>■ oo (for fixed p) and as p —>■ oo (for fixed n). 

Using the bounds from Examples 2.1 and 2.3, the number of iterations required to obtain 
convergence to a desired tolerance in total variation norm grows at least exponentially fast 
in p. Note that it is an upper bound and not a lower bound. To gain some insight into 
the possible disadvantages of these upper bounds in high-dimensional settings, consider the 
dimensions of the various quantities that appear in the proof of geometric ergodicity. More 
specifically, if the dimension of the distributions P and Q in the minorization condition 
in (2.3) is p, consider the constant e = e n ^ p that appears in the minorization condition 
in (2.3), noting that 1 — £ essentially corresponds to the upper bound for the geometric 
convergence rate. This constant will often take the form e n ^ p = (e*) p for some 0 < £* < 1. 
For example, if P is expressable as a product of p independent marginal distributions, then 
it will often be necessary to find a bound akin to the minorization condition in (2.3) for each 
such marginal distribution. Thus, the “overall” £ will be a product of p “individual” £-type 
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quantities. Hence, it is often the case that e n ^ v —» 0 as p —» oo. If indeed c n)P —>• 0, then the 
resulting bound on the geometric convergence rate (namely, 1 — e n ^ v or some power thereof) 
tends to 1 and hence is not useful. Such problems have hampered attempts to combine 
Rosenthal’s method with dimensional asymptotics (see, for instance, Hu and Rajaratnam, 
2012). Thus, alternative strategies may be required if we wish to obtain convergence rates 
that do not tend to 1 as the dimension grows. On the other hand, it may instead be 
asked whether there exist settings in which Rosenthal’s technique can overcome these high¬ 
dimensional obstacles. We show later in Section 7 that a specifically tailored application of 
Rosenthal’s approach may still lead to a convergence rate that is bounded away from 1. 


3 Regression Models & Graphical Models 

We now begin to pursue our goal of obtaining precise bounds for the geometric convergence 
rate of high-dimensional MCMCs. To this end, we first undertake a thorough investigation 
of the behavior of the Gibbs sampler for a Bayesian analysis of the standard regression 
model. The properties of this basic model are essential for illuminating the problems that 
certain types of Gibbs samplers encounter in high-dimensional regimes. These results also 
lead directly to corresponding results for Gibbs samplers in an important class of graphical 
models. 

Consider a Bayesian analysis of the standard regression model 

Y \(3,a 2 ~N n (X(3, a 2 I n ), 

(3 | cr 2 ~ N p ( Op, A - V 2 I P ), (3.1) 

7r((T 2 ) oc 1/a 2 , a 2 > 0, 


where X is a known n x p matrix of covariate values and A > 0 is a known regularization 
parameter. We assume n > 5 to facilitate the technical analysis. Then a Gibbs sampler 
to draw from the joint posterior under (3.1) may be constructed by taking an initial value 
Uq > 0 and then drawing (for every k > 1) 

(3k | ol- n Y ~ n p(P, oi-iA~^j, 

I (3k, Y ~ InverseGamma 


(3k ~ (3) A l(3k — (3j + C 


(3.2) 


where A = X 1 X + XI p (which is positive-definite), (3 = A 1 X 1 Y, and C = Y 1 (I n — 

XA~ 1 X t )Y. 
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3.1 Convergence Rates 


In order to understand the convergence behavior of the Gibbs sampler in (3.2) corresponding 
to a standard regression model, we proceed to undertake a fundamental analysis of this 
MCMC scheme. 

We now establish sharp bounds for the geometric convergence rate of the standard 
Bayesian regression Gibbs sampler in (3.2) in total variation norm in terms of the dimension p 
and sample size n. For every k > 0, let Fk{a(f) denote the joint distribution of ((3k, cr^) for 
the chain in (3.2) started with initial value a(, and let F denote the stationary distribution 
of this chain, i.e., the true joint posterior of ((3, a 2 ). Then we have the following result. 


Theorem 3.1. For the standard Bayesian regression Gibbs sampler in (3.2), there exist 
0 < Mi < M 2 such that 


Mi 


P 


n + p — 2 


< dxv [Fk(&l)i F] 


< M 2 


P 


n + p — 2 


k 


for every k > 0. 


Note that if p = p n grows faster than n, then the sharp bound provided by Theorem 3.1 
tends to 1. Hence, Theorem 3.1 provides our first theoretical indication of the precise nature 
of the convergence problem in high-dimensional Markov chains. I 11 Supplemental Section B, 
we also obtain similar rates in terms of Wasserstein distance dw , including expressions for the 
multiplicative constants in the bounds (i.e., the equivalent of Mi and M 2 in Theorem 3.1). 
These results allow us to derive expressions for the number of iterations required for conver¬ 
gence of the chain to within a given tolerance e > 0. We show that the number of iterations 
required for convergence to within £ grows only linearly in p and not exponentially. This is 
an encouraging result. A complete discussion of convergence rates in terms of Wasserstein 
distance can be found in Supplemental Section B. 

We also note that Roman and Hobert ( 201 ) establish a useful result concerning geometric 
ergodicity of a Gibbs sampler for the linear mixed model. Their results, however, do not 
provide a quantitative bound on the geometric convergence rate itself. Thus it is not clear 
how the geometric rate behaves as a function of the sample size n and the dimension p. I 11 
contrast our analysis obtains sharp quantitative bounds for this geometric convergence rate 
in terms of n and p for standard regression. 
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3.2 Characterization of Convergence 

The behavior of the Gibbs sampler in Subsection 3.1 in various n and p regimes can be 
further examined by considering the nature of the joint posterior distribution itself. The 
following lemma provides insight regarding the posterior correlation between cr 2 and a par¬ 
ticular function of f3. Specifically, let 6 = A l ^ 2 {(3 — /3), and note that ||0|| 2 represents a 
Mahalanobis-type distance between (3 and the posterior mean /3. Then we have the following 
result. 


Lemma 3.2. For the posterior of the standard Bayesian regression framework in (3.1), 


Corr( 


a 


\e 




P 


n + p — 2 


Thus, by Lemma 3.2, the posterior correlation of a 2 and ||0|| 2 tends to 1 asymptotically 
if p n ^ 0(n). This behavior is a consequence of the way in which the prior on (3 and cr 2 is 
specified under the Bayesian regression framework in (3.1). Specifically, observe that under 
this prior, we have 


1 

P 


ml 


o 


rs-/ 


Gamma 


(P pA A 

\2’ 2a 2 ) ’ 


a 2 | (3 


InverseGamma 



(3.3) 


Observe from (3.3) that 


e{- ml 

\P 




Var("-||/3|| 2 

\P 



2 

P 



2 


If p is large, the prior distribution of p 1 ||/3||| | cr 2 is highly concentrated around a 2 / A. 
Similarly, observe from (3.3) that 


E(a 2 | (3) 


p-2 ’ 


Var(cr 2 | (3) 


2 

p — p — 2) 


If p is large, the prior distribution of cr 2 | (3 is highly concentrated around A11/3111/(p — 2). 
Thus, for large p, the prior is highly informative about the relationship between ||/3|| 2 and a 2 . 
It can be shown that this high dependence carries over to the posterior in the regime where 
p n because the data is overwhelmed by the prior. The posterior dependence between 
the parameters manifests itself through the conditionals that are used in the Gibbs sampler. 
The value of H^IH = (/3fc — (3) T A(/3 k — (3) is heavily dependent on the value of cr 2 _n and in 
turn the value of cr^_ 1 is heavily dependent on the value of ||0^._ 1 || 2 . Thus, each iteration of 
the joint and marginal Gibbs sampling chains is highly dependent on the previous iteration. 
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The same concept may be alternatively expressed by stating that the chain mixes poorly. 
More specifically, one manifestation of this poor mixing behavior is high autocorrelation 
between successive values a\ and cr/! +1 , even if the chain is in its stationary state. This 
property is formalized in the following lemma. To state the result, let G denote the stationary 
distribution of the marginal chain for the chain in (3.2), i.e., the true marginal posterior 
of a 2 under (3.1), which is the InverseGamma(n/2, C/2) distribution. 

Lemma 3.3. Consider the standard Bayesian regression Gibbs sampler in (3.2). If ~ G, 
then for every k> 0, Corr(a|, a/ +1 ) = Corr(||0 fc |||, ||0 fe+ i|||) = p/(n + p - 2). 

Although Lemma 3.3 asserts that the norms of the 9k chain are highly dependent, their 
directions are independent. This fact is established in the following lemma. 

Lemma 3.4. Consider the standard Bayesian regression Gibbs sampler in (3.2). If Uq ~ G, 
then the vectors /1 |||2 are independent for all k > 1. 

The behavior of the standard Bayesian regression Gibbs sampler in (3.2) as described 
by Lemmas 3.3 and 3.4 can be interpreted geometrically. For any t > 0, the set {9 € R p : 
||0||| = 0 defines a hypersphere in 0-space, which corresponds to a hypercllipsoid in /3-space. 
As discussed previously, the value of || 0^- 111 i s ver y highly dependent on the value of ||0fc_i ||| 
when p n. Hence, in the p^> n regime, 0/. is likely to fall on a hypersphere very close to 
the hypersphere on which 0*._ 1 falls. It then follows that (3k and (3k-i are also likely to fall 
on hypercllipsoids that are very close together. Note that the center of the hyperspheres in 
0-space corresponds to the posterior mean (3 in /3-space. Thus, the (3k chain has difficulty 
moving to points “closer to” or “farther from” the posterior mean (3 as measured by the 
Mahalanobis distance 1101 |2 = || A 1//2 (/3 — /3)|| 2 - This behavior is illustrated in Figure 1. (It 
should be noted, however, that this behavior only arises when p is large, so an illustration 
with p = 2 should be interpreted merely as a conceptual representation of the behavior in 
question.) Meanwhile, the behavior of the marginal a\ chain as described by Lemma 3.3 
is somewhat simpler. The a/, simply exhibits a high autocorrelation, i.e., it has difficulty 
moving at all. 

In a practical sense, it is important to understand the manner in which the convergence 
problems discussed in the previous paragraph would affect inference based on Gibbs samples 
that have been ostensibly (but not actually) drawn from the approximate posterior. First, 
note that the aforementioned autocorrelation phenomenon occurs between a 2 and the norm 
of 0, not the direction of 0. Thus, even if the chain mixes slowly, the average of the 9k 
iterates should be close to the origin in 0-space. It follows that the average of the (3k iterates 
should be close to the posterior mean (3 in /3-space. Thus, even when p/{n + p — 2) is close 
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Figure 1: Geometric representation of the behavior of the stan¬ 
dard Bayesian regression Gibbs sampler in (3.2) in terms of 6 = 
A 1/2 (f3 — (3). If the point on the solid circle represents the value 
of 6 k , then the value of Ok+i falls with high probability in the shell 
of values (region between dashed circles) where || 0 fc+i ||2 is approx¬ 
imately equal to \\ 0 k \\2 (solid circle). 


to 1, the chain can still yield a good approximation of the posterior mean /3 of the regression 
coefficients /3. However, there may be substantial error when using the Gibbs sampling 
output to approximate either the posterior variance of /3 or the posterior mean of a 2 . It is 
indeed possible for the f3 k iterates to be distributed closer to the posterior mean than they 
should be, in which case approximate credible intervals based on the Gibbs sampling output 
will be too narrow. On the other hand, if the (3 k iterates are distributed too far from the 
posterior mean, then the approximate credible intervals will be too wide. Thus, uncertainty 
quantification, one of the fundamental advantages of posterior inference, may be seriously 
compromised. 

3.3 Convergence Diagnostics 

In practice, MCMC convergence behavior is often assessed through convergence diagnostics. 
These methods can be useful for identifying various kinds of convergence problems in some 
settings, though it is well understood that they do not establish convergence in any rigorous 
sense. 

Consideration of convergence diagnostics is especially important in high-dimensional set¬ 
tings because the types of convergence problems described in this subsection and the previ¬ 
ous subsection may not be detectable if only certain convergence diagnostics are considered. 
More specifically, note that the primary parameter of interest in the regression model is /3. 
As discussed in Subsection 3.2 and illustrated in Figure 1, the draws of the /3 k iterates in 
the standard Bayesian regression Gibbs sampler can be highly dependent in terms of their 
respective distances from their distribution’s center, but their directions from the center are 
independent and identically distributed. Then any convergence diagnostic that focuses on 
plotting, testing, or otherwise analyzing only linear functions of the [3 k is likely to fail to 
identify any problem. For example, a trace plot of the marginal chain for any component 
of (3 k will likely appear to have converged and to be uncorrelated. Similarly, the Geweke 
diagnostic (' feweke, 1992), which compares the means of the iterates from earlier versus later 
portions of the chain, is also likely to fail to detect any problem. 
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On the other hand, the above high-dimensional convergence problems can be detected 
by certain convergence diagnostics when applied to certain parameters. For instance, in 
standard Bayesian regression, a trace plot of the marginal chain for the nuisance parameter 
(i.e., cr|) will indeed reveal if the chain is slow to converge or is highly autocorrelated. Slow 
convergence can similarly be detected by the Geweke diagnostic as applied to the nuisance 
parameter. (These same diagnostics can also reveal the problem if they are applied to 
quadratic, rather than linear, functions of (3k since the behavior of such quadratic functions 
of (3k may indeed be similar to that of af, noting the result of Lemma 3.3 ). Hence, the 
essence of the message from the above analysis is that it is important to examine convergence 
diagnostics for all parameters, not merely the parameter of interest, if high-dimensional 
convergence problems are to be detected. Moreover, certain other convergence diagnostics 
exist that may be more readily able to detect problematic phenomena when they do occur. 
For the regression setting, diagnostics based on the relative variability within and between 
various (3k chains may indeed be useful. Examples in this regard include the Gelman-Rubin 
diagnostic ( lelinan and Rubin, 1992). Such an approach may not be feasible under the 
limited computational budget that is often present in high-dimensional settings. 


3.4 Convergence Rates for Graphical Models 

The results obtained for the Gibbs sampler for the standard Bayesian regression framework 
can also be applied in the context of a Bayesian analysis of a class of Gaussian graphical 
models. We first define some notation. Let Q = (V, E) be a directed acyclic graph (DAG) 
with vertex set V — {1,..., m } and edge set E C V x V. Assume that i > j for all (z, j ) e E, 
i.e., assume that Q is parent-ordered. Let 


pa(j') = {i eV : ( i,j ) e E}, fa (j) = pa (j) U {j}, 


denote the parents and family of vertex j. Let 5j = | pa(ji) denote the cardinality of the set 
of parents of vertex j, which we shall call the degree of vertex j. Now consider the Gaussian 
DAG model in Cholesky form 


X\i ..., X n | D, L ~ iid N„ 


0 , 


L-^DL ' 1 


(3.4) 
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where D = Diag(cr^ pa / 1 ^,..., cr^| p , •)), and where the elements of L are 




1 if i = j, 

< -Pij if i G pa (j), 
0 if i ^ fa(j). 


Now suppose we take the prior on D and L to be (D, L ) ~ iru,a, that is, the DAG-Wishart 
prior as defined by Ben-David et al. ( 015). Combining this prior with the DAG model 
in (3.4) yields the Bayesian DAG framework. The posterior distribution of (D,L) then 
factorizes as 


-k(D,L \X u .. .,X n ) = n I> P a.{j),j | Xi, ..., X n ), 

3 = 1 


i.e., the posterior distributions of (Djj, L pa ,(j)j) are mutually independent for each j G 
{1,... ,m} (Ben-David et ah, 2015). Then we can execute separate Gibbs samplers for 
each of these posterior distributions for each j G {1, and combine them to yield 

samples from the overall joint posterior of (D, L). 

To state the form of these Gibbs samplers, we first define an additional item of notation. 
For any m x m matrix H and any two index subsets A,BO {1,..., m}, we write Ha,b to 
denote the submatrix of H formed by retaining the jth row if and only if j G A and the 
kth column if and only if k G B. (Note that if A or B is a singleton set, i.e., A = {a} or 
B = {fe}, then we will write simply H o Rl H^b, or H ab .) 

Now suppose that for each j G {1,..., m}, we set an initial value Djj ;0 > 0. Then a 
Gibbs sampler for drawing from the posterior of (Djj, L p ^j)j) takes the form 


-^'pa(.7')j;fc Bj + \/Djj■ fc-i Wj Zj,k> 


Djj:k 


V ., 


j,k . 


Wj ^ (-C'pa {j),j;k Bj) 


+ Ci 


where Z jik ~ N Sj (0 5j , Isj), 
where V^ k ~ X^ +Q ._ 2 , (3.5) 


where Wj — Chpa(j),pa(j) + ^'S’pa(j),pa(j), Bj — ~Wj {Upa.(j),j-knS pa .(j')j), and Cj — Ujj -\-nSjj 
nJWj^ij, and where all of the Z ]k and Vj )k are independent. (See Ben-David et ah, 2015, 
for the form of the relevant conditional distributions.) 

Observe that if we take ay = S 3 + 2, then this Gibbs sampler has the same form as the 
standard Bayesian regression Gibbs sampler in (3.2) with p = Sj. We can therefore use our 
convergence results for standard Bayesian regression to obtain convergence results in the 
Bayesian DAG framework. To state and prove the said result, let Ej t k(Djj . 0 ) denote the 
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distribution of (Djj. k , Lpn(j),j-,k) for the jth chain initialized at Djj . 0 , and let S j denote the 
corresponding stationary distribution for j G {1,..., m}. Then let E k (D 0 ) denote the distri¬ 
bution of (Dk, Lk ) for the overall joint Gibbs sampler, and let S denote the corresponding 
stationary distribution. Also let <5 max = ma Xi<j< m 6j. The following result now gives sharp 
bounds for the convergence rate of the DAG-Wishart Gibbs sampler. 

Theorem 3.5. For the Bayesian DAG Gibbs sampler in (3.5), there exist 0 < M\ < M 2 
such that 


M x 


n + 5 n 


k 

< d T y[E k (D 0 ), 5] 


< m 2 


n + 5 n 


k 


for all sufficiently large k. 


Thus, the geometric rate constant of the Bayesian DAG Gibbs sampler in (3.5) is bounded 
away from 1 asm and n tend to infinity if and only if <5 max = O(n), i.e., if and only if the 
maximum degree of any vertex grows no faster than the sample size. Thus, the convergence 
complexity of the Bayesian DAG Gibbs sampler is closely related to its sparsity. This result 
provides yet another motivation for desiring sparsity in modern high-dimensional settings. 


4 Bayesian Model Selection 

We now turn our attention to Gibbs samplers for Bayesian model selection. Important 
contemporary cases include the Bayesian lasso of Park and Casella (2008) and the spike- 
and-slab prior of litchell and Beauchamp ( 988). The form of the priors we consider below 
is general enough to accommodate other “regularized” Bayesian approaches to regression as 
well. They are also easily extended to model selection in other statistical models. 

The Bayesian analysis of the standard regression model in (3.1) can be generalized by 
replacing the prior on (3 \ a 2 with a scale mixture of normal distributions, i.e., by taking 

Y\f3,a 2 ~N n (Xf3, a 2 I n ), 

P I cr 2 ,r ~ iVp(O p , a 2 D T ), (4.1) 

7r(cr 2 ) oc 1 /cr 2 , 
r ~ 7r(r), 

where r is a p-dimensional vector of positive hyperparameters and D T = Diag(r 1; ... ,r p ). 
A variety of priors for (3 j a 2 can be represented by the hierarchical construction in (4.1) 
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above, as will be discussed in Subsection 4.1 below. We will use the term Bayesian model 
selection framework to refer in general to the model and priors in (4.1) above. 

Now suppose that we can sample from the conditional posterior tt (t | /3,a 2 ,Y) for all 
(3 eW and all a 2 > 0, as is often the case. (See Subsection 4.1 for examples.) Then a Gibbs 
sampler to draw from the joint posterior under (4.1) may be constructed by taking initial 
values /3o £ M p and Uq > 0 and then drawing (for every k > 1) 

r k | p^al^Y ~ tt(t | (3 = i, a 2 = a\_ x , W), 

Pk | °\-x, Tk, Y ~ N p (p Tk , , (4.2) 

°h I Afc, Tk, Y ~ InverseGamma 


n + p 


W(A-/3 


Tk 


+ c. 


Tk 


where A T = X 1 X + D T l (which is positive-definite), (3 T = A T 2 X 7 Y , and C T = Y T (I n — 

X A~ L X T )Y. 


4.1 Special Cases for Model Selection: Bayesian Lasso, Bayesian 
Elastic Net, & Spike-and-Slab 

Suppose that r 1; ... ,r p are assigned independent Exp(A/2) priors, where A > 0. Then the 
resulting marginal prior on the regression coefficients /3 | cr 2 is a product of Laplacian (double 
exponential) distributions: 

i j2 )=n \ v^ exp (- \f^ iftij ■ 

The conditional posterior of r is then 


To 


A a 2 


(3,a\Y ~ ind. InverseGaussian ,.. 

VV P 


A 


This particular hierarchical representation is the original formulation of what is typically 
called the Bayesian lasso (Park and Casella, 2008). 

Suppose instead that T\,... ,r p are assigned the prior 


7r(ri,...,r p ) = 


Ai 


2(1 - A 


exp 


Al Tj 


2(1 - A 2Tj ) 


1(0, i/a 2 )(t j ), 
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where Ai, A 2 > 0. Then the resulting marginal prior on the regression coefficients (3 \ a 2 has 
the form 


p 

7 t((3 I cr 2 ) oc exp 
3 =1 



A2 

2^2 



The conditional posterior of r is then 


-Ac 


Tn 


ft, cr 2 , Y ~ ind. InverseGaussian 


1 Aicr 2 

w 


Ai 


This particular hierarchical representation is known as the Bayesian elastic net ( Li and Lin, 
2010 ; Ivyung et ah, 2010). 

As another example, suppose instead that the priors on T \,..., r p are again taken to be 
independent, but for all j G {1,... ,p}, take P{r 3 = KjCj) = w 3 = 1 — P{t 3 = Q), where 
(j > 0 is small, k 3 > 0 is large, and 0 < w 3 < 1. This is a slight variant of the prior proposed 
by George and McCulloch (1993) to approximate the spike-and-slab prior of Mitchell and 
( ). The prior on ft is specified conditionally on cr 2 , with Var(/3 | cr 2 , r) oc cr 2 . 

Then r 3 \ (3,cr 2 ,Y are conditionally independent a posteriori with 


P(t 3 = KjCj | ft, cr 2 , Y) = 1 - P(t 3 = Q | ft, a 2 , Y) 

L , ( 1 -^)^% 

= <; 1 H--— exp 


ft 2 ( n; — 1 


-1 


2cr 2 y KjQ 

by straightforward modihcation of the results of leorge and McCulloch (1993). 


4.2 Convergence Properties 

The Gibbs sampler in (4.2) for Bayesian model selection is easily executed in practice. In 
comparison to standard regression, the additional step in the Gibbs sampling cycle makes it 
less tractable in the context of analyzing convergence rates in various n and p regimes. Never¬ 
theless, geometric ergodicity has been obtained for important special cases in Subsection 4.1. 
The Gibbs sampler for the modified spike-and-slab model was shown by Diebolt and Robert 
(1990) to be geometrically ergodic, but without quantitative bounds on the geometric conver¬ 
gence rate. For the Bayesian lasso Gibbs sampler, Khare and Hobert (2013) used the method 
of Rosentha ( 995) to establish geometric ergodicity and derive a quantitative bound f on the 
geometric convergence rate r. In Example 2.3 and Lemma 2.4, we showed that this bound r 
tends to 1 exponentially fast as either p or n tends to infinity. As this result is an upper 
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bound based on Rosenthal’s method, it is not clear whether it is sharp in high-dimensional 
regimes. Thus, it does not answer the question of the chain’s actual convergence rate or the 
rate at which it may tend to 1 as n or p grows without bound. To address this question 
for Gibbs samplers for the Bayesian lasso, Bayesian elastic net, spike-and-slab priors, and 
other Bayesian model selection frameworks, we now provide an autocorrelation result that 
is similar to that of Lemma 3.3. 

Theorem 4.1. Consider the Gibbs sampler in (4-2) for Bayesian model selection. Suppose 
that (/3 k ,a k ) ~ 7r(/3,cx 2 | Y). Then 


Corr (a 2 k , a 2 k+l ) > 


P 


n + p — 2 


1 - 


Y t Y 


p A/Var(a 2 | Y) 


Suppose we make the mild assumption that ||YJj||| = 0(n), and suppose also that Var(cr 2 | 
Y) = 0(l/n), as is commonly the case. Then it is clear that the lower bound in Theorem 4.1 
tends to 1 in the limit as p n /n 3//2 —» oo. Hence, the MCMC convergence problems seen 
in Section 3 for the standard regression model occur once again for the Gibbs sampler 
for Bayesian model selection if p n grows too fast relative to n. This phenomenon is of 
course concerning as the lasso is specifically designed for high-dimensional settings where 
p n. The sharpness of the above lower bound for the autocorrelation is further investigated 
numerically in Subsection 4.3 below. 


4.3 Numerical Results for Bayesian Model Selection 

Theorem 4.1 provides a lower bound for the autocorrelation between successive iterates of 
the a 2 chain of the Gibbs sampler in (4.2) for Bayesian model selection. It is not immediately 
clear whether this bound is sharp, so it is also instructive to use numerical approaches to 
understand the high-dimensional convergence behavior of Gibbs samplers of this form. The 
left side of Figure 2 plots the autocorrelation in the cr 2 chain versus p/(n + p — 2) for various 
values of n 6 {10, 30,100} and p E {10, 30,100} as observed from runs of the Gibbs sampler 
for the Bayesian lasso. The center and right side of Figure 2 are similar plots for the Gibbs 
samplers for the Bayesian elastic net and the spike-and-slab prior (respectively). (The exact 
details of these runs can be found in Supplemental Section H.) Such plots can be useful tools 
when sharp theoretical bounds are not available. The strength of the linear relationship in 
Figure 2 strongly suggests that the ratio pj (n + p — 2) governs the convergence behavior of 
the Gibbs samplers for a variety of Bayesian regression approaches that can be written in 
the form specified by (4.1). Thus, although the theoretical result of Theorem 4.1 is slightly 
less refined than those obtained for the standard regression model in Section 3, it is clear 
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from Figure 2 that these Markov chains exhibit the same convergence complexity as in the 
standard regression setting. 


Oo 
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Figure 2: Autocorrelation of the chain versus pj{n + p — 2) for the Gibbs sampler for 
the Bayesian lasso (left), Bayesian elastic net (center), and the spike-and-slab prior (right). 
See Supplemental Section H for details of the generation of the various numerical quantities, 
vectors, and matrices that were used in the execution of these chains. 


In summary, our theoretical and numerical analysis above indicates that regardless of the 
type or form of regression (standard regression, lasso, elastic net, or spike-and-slab), there 
is a universal geometric convergence rate of the form r = p/(n + p — 2). 


5 Multivariate Location Models 

A conceptually simple but centrally important class of models is the class of multivariate 
mean or location models. We now consider the convergence complexity of Markov chains 
associated with a Bayesian analysis of such models. We shall see that though there are 
some similarities between location models and regression, convergence complexities are vastly 
different. 

Let Xi,... ,X n be observed data vectors taking values in M p , and consider the multi¬ 
variate mean model and priors 

Xi | pi, a 2 ~ iid N p (n, a %), 

/i | cr 2 ~ iid Ap(0p, A -1 cr 2 J p ), (5.1) 

7r(<7 2 ) oc 1/cr 2 for all a 2 > 0, 

where i G {1 ,,n} with n > 3, and where A > 0 is known. Then a Gibbs sampler to draw 
from the joint posterior under (5.1) may be constructed by taking an initial value o\ > 0 
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and then drawing (for every k > 1) 



a\ | /x fc , X \,.... X n ~ InverseGamma 


np + p (n + \)\\nk~ v\\l + C 


(5.2) 


2 


2 


where fi = {n + A) 1 Y^=i X i anci C = Ya= i II-AG II| - (1 + n 1 A)||^||| 


5.1 Convergence Properties 

The convergence properties of the Gibbs sampler in (5.2) for the multivariate mean model can 
be obtained using the results previously established in Section 3 for the standard regression 


Gibbs sampler in (3.2). For every k > 0, let Ffc(o'o) denote the joint distribution of (/^aj?) 
for the Gibbs sampler of the multivariate mean model in (5.2) started with initial value <7 q. 
Let F denote the stationary distribution of this chain, i.e., the true joint posterior of (/x, a 2 ). 


Then we have the following result. 

Theorem 5.1. Consider the Gibbs sampler for the multivariate mean model in (5.2). Then 
there exist 0 < Mi < M 2 such that 



for every k > 0. 

Despite the apparent similarities between the standard regression model and the mul¬ 
tivariate location model, it is clear from Theorem 5.1 that the respective Gibbs samplers 
display different convergence complexities. In particular, as p —> oo with n fixed, the ge¬ 
ometric convergence rate of the standard Bayesian regression Gibbs sampler tends to 1. 
However, the convergence rate of the Gibbs sampler for the multivariate mean model tends 
to l/(n + 1). Moreover, r < 1 /(?z — 1) for any p. Thus, the convergence rate of the Gibbs 
sampler for the multivariate mean model is bounded away from 1 in all n and p regimes. 

As was the case in Bayesian regression, we can again establish sharp results in terms of 
Wasserstein distance. These results can be found in Supplemental Section D. 
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6 Normal Hierarchical Model 


Hierarchical models are an important class of models that have found widespread applications 
in many fields. They have thus become a staple in contemporary Bayesian inference. Their 
flexibility and ability to avoid overfitting makes them ideal for modern high-dimensional 
settings. Hierarchical models are also ideally suited for Bayesian analysis since they are 
readily amenable to posterior inference using Gibbs samplers. To further investigate notions 
of convergence complexity, we thus turn our attention to Markov chains associated with a 
Bayesian analysis of a common type of model: the normal hierarchical model. We begin 
by first considering a simplified version of such a model in which the variance components 
are known. We subsequently investigate the unknown-variance version of this hierarchical 
model as well. 

Let X|,.... X„ be observed data vectors taking values in M p , and consider the following 
hierarchical model and priors: 


X t | i/h ~ inch Np^i, a 2 I p ), 

i\)i | /x ~ iid N p (v,t 2 I p ), (6.1) 

7r(/_£,) oc 1 for all /x G M p , 


where i G {1,..., n} and where a 2 > 0 and r 2 > 0 are known. Then a Gibbs sampler to draw 
from the joint posterior under (6.1) may be constructed by taking an initial value /x 0 G 
and then drawing (for every k > 1) 


ipk,% | Mfc-i, X n ~ ind. N p [(1 - r)X\ + r/x fc _ L , r 2 rl p ], 

1 


2=1 


Mfc I ^Pk, 1 ; • • • > ^Pk,m • • • , X n ~ Np | ^ ^ ^ ^ Ip J ■ 

for each i G {1,... ,n}, where r = cr 2 /(cr 2 + t 2 ). 


( 6 . 2 ) 


6.1 Convergence Properties 

We now establish sharp bounds for the geometric convergence rate of the Gibbs sampler 
in (6.2) for the normal hierarchical model. For every k > 0, let Hk(^o) denote the distribution 
of for the normal hierarchical model Gibbs sampler in (6.2) started with initial value /io, 
and let H denote the true marginal posterior of /x. Then we have the following result. 
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Theorem 6.1. Consider the Gibbs sampler for the normal hierarchical model in (6.2). Then 


/ Ti 

a 2 + r2 H^0ll2 rA: 

for all sufficiently large k, where r = a 2 /(a 2 + r 2 ). 

Note in particular that Theorem 6.1 provides an expression for the geometric convergence 
rate r that does not depend on n or p. Thus, the Gibbs sampler specified in (6.2) for the 
normal hierarchical model does not exhibit the same high-dimensional convergence problems 
that were seen in Sections 3 and 4 for the regression model and its extensions. More precisely, 
the geometric convergence rate r n p of the Gibbs sampler for the normal hierarchical model is 
(trivially) bounded away from 1. In this respect, the convergence complexity of the normal 
hierarchical model is similar to that of the location model in Section 5. (Note also that this 
result shows that the Gibbs sampler converges faster when the population variance r 2 takes 
larger values. ) 

Similarly to the autocorrelation result in Section 3, it can be shown that if Ho ~ H (i.e., 
if the chain is stationary), then 

d2 

Corr(/ifc j,/Xfc+i j) = 9 (6.3) 

cr" + r z 

for each j £ {1,... ,p}. The autocorrelation result in (6.3) above contrasts with the auto¬ 
correlation result for standard regression in Lemma 3.3 in the same way that the conver¬ 
gence rate in Theorem 6.1 constrasts with the convergence rate for standard regression in 
Theorem 3.1. 

We can once again establish sharp results in terms of Wasserstein distance as well. These 
results can be found in Supplemental Section E. 

6.2 Unknown Variances &; Convergence Rates 

The normal hierarchical model in (6.1), in which the variances are known, is simple enough 
to yield a Gibbs sampler that permits the derivation of sharp bounds for the geometric 
convergence rate. It is also of interest to consider a more complex model in which the 


n 


2 (cr 2 + t 2 ) 
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variances are unknown. Thus, suppose we have 


Xi | ipi.a 2 ~ ind. N p (^i,a 2 I p ), 
tyi | M, t 2 ~ iicl N p (n, t 2 I p ), 

a 2 ~ InverseGamma(a cr /2, s a / 2), (6.4) 

t 2 ~ InverseGamma(a T /2, s T / 2), 

7r(/x) oc 1 for all /i G M p , 


where a T ^ > 0 are all known. The posterior for the above setup is less tractable, 

and MCMC is indeed required to sample from the posterior. A Gibbs sampler to draw from 
this posterior takes initial values /x 0 G R p and <7 q,Tq >0 and then draws (for every k > 1) 


VhM I Atfc - i , 0 fc _: L > T < fc -~ ind - 


(1 — Pk-l)Xi + Pk-ll^k-l, 


n - 2 T 2 
°k-l‘k-l j 

a 2 + r 2 p 
°fc-i + r fc-i . 




V’ft.i," -. °i- 1, r »-i. v ~ I - i/v.. A 1 /, ), 
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a 2 k I 1,..., */>*,„, Aifc, Tt—i, ~ InverseGamma 


r fc 2 | ..., i/h:,n, li k ,crl,X ~ InverseGamma 


a^P-np 1 
2 ’ 2 * S ° 

a T + np 1 

2 ’ 2 1 St 


^ ^ | X.j 'iftk, 
2=1 
n 

5^11 ^k,i ~ Mfc 


012 


2=1 


for each i G {1, ..., n}, where p fc _i = cr^/(a|_ 1 + r 2 _,) and X = (X 1; .. . , X n ). 

Since sharp theoretical results for the convergence rate of the above Gibbs sampler are not 
as readily quantifiable, we now provide a numerical demonstration of the behavior of these 
chains in relation to n and p. Figure 3 plots the autocorrelation in the r| chain for various 
values of n G {10,30,60,100,150,210} and p G {3,10,30,100,300} for the Gibbs sampler 
in (6.5) for the unknown-variance normal hierarchical model. (The exact details of these 
runs can be found in Supplemental Section H.) Figure 3 also plots the same information as a 
three-dimensional autocorrelation surface and an autocorrelation contour plot. These plots, 
which we call dimensional autocorrelation function (DACF) plots, depict the autocorrelation 
as a function of the sample size n and the dimension p. It is clear from Figure 3 that the 
convergence behavior of the chain corresponding to the unknown-variance hierarchical model 
is dramatically different from the known-variance case in high dimensions. In fact, both 
large sample sizes and high dimensions affect the autocorrelation adversely. The seemingly 
harmless practice of putting a prior on a clifficult-to-specify quantity in fact leads to a chain 
that converges slowly for large n or p. Thus, in the unknown-variance setting, it appears 
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that a sample-starved high-dimensional hierarchical model enjoys better convergence than 
a sample-rich high-dimensional hierarchical model. In this sense the hierarchical model is 
better suited to “large p, small n” applications thanto “large p. large n" applications. 




Figure 3: Autocorrelation of the r| chain versus the sample size n (left) for various values of p. 
Dimensional autocorrelation function (DACF) plots in both surface (center) and contour 
(right) forms for the r| chain relative to n and p for the unknown-variance normal hierarchical 
model. See Supplemental Section H for details of the generation of the various numerical 
quantities, vectors, and matrices that were used in the execution of these chains. 


6.3 Unknown Variances &; Bounded Convergence Rates 

Empirical Bayesian methods provide one straightforward way to obtain bounded convergence 
rates for the normal hierarchical model with unknown variances. If the values of a 2 and r 2 
are set by an empirical Bayesian approach (e.g., by taking the values that maximize the 
marginal likelihood), then these values d| B and f| B may simply be inserted into the Gibbs 
sampler for the known-variance model. It then follows immediately from Theorem 6.1 that 
the convergence rate is <r| B / (d| B + f| B ). 

Even when variances are unknown, we now show that it is still possible to achieve bounded 
convergence rates for the normal hierarchical model. Some insight into a possible solu¬ 
tion may be gained by observing that a known-variance approach is simply the limit of an 
unknown-variance approach as the priors on the variances tend to degeneracy at particu¬ 
lar points (the “known” values). More precisely, suppose that we retain the independent 
inverse-gamma priors for a 2 and t 2 as specified in (6.4), but suppose we take a a , s a , a T , 
and s T to grow proportionally to np. Figure 4 shows plots analogous to Figure 3 in which 
we have taken a a = s a = a T = s T = np, i.e., dimensionally-dependent. It is clear from 
Figure 4 that the convergence rate remains bounded away from 1 for all n and p. Thus, 
the dimensionally-dependent prior for the unknown variances yields dramatically improved 
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convergence complexity. We will revisit this idea in Subsection 7.3 for the regression setting 
as well. 



n 


Figure 4: Autocorrelation of the r| chain versus the sample size n (left) for various values 
of p. Dimensional autocorrelation function (DACF) surface plot for the r| chain relative 
to n and p (right) for the unknown-variance normal hierarchical model. Both plots take 
a a = s a = a T = s T = np. All other settings are the same as in Figure 3. 


ft is also possible to use the empirical Bayesian approach to set the values of the points 
to which the aforementioned dimensionally-dependent priors converge. Such a hybrid ap¬ 
proach would enjoy bounded convergence rates while both allowing the variances to remain 
stochastic and permitting sensible choices of the corresponding hyperparameters. 


7 Bounded Geometric Convergence Rates for High- 
Dimensional Regression 

Recall that as discussed in Subsection 2.4, many applications of the method of Rosenthal 
(199 ) yield an upper bound for the geometric convergence rate that tends to 1 as the di¬ 
mension p tends to infinity. It was demonstrated in Sections 3 and 4 that in the important 
regression setting, the actual convergence rate (as opposed to merely a bound) tends to 1 
if the dimension p grows faster than the sample size n. Thus, MCMC-based inference for 
regression when n = o(p) (i.e., in modern high-dimensional settings) remains a critical hur¬ 
dle. On the other hand, Sections 5 and 6 demonstrated that in important classes of models 
like location models and hierarchical models, Gibbs sampling-type MCMC enjoys bounded 
convergence rates. Then it may be asked (i) whether bounded convergence rates can never¬ 
theless be attained for regression models, and (ii) whether such bounded convergence rates 
can be rigorously established by the method of hosenthal (1995). In this section, we present 
two approaches to address these issues in the regression setting. First, we propose a con- 
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Crete framework in which Rosenthal’s method can still be used to obtain bounds on the 
convergence rate that do not tend to 1 as p —> oo. We apply this technique to a regression 
model with independent priors on (3 and cr 2 to obtain the aforementioned bounded conver¬ 
gence rate. Second, we propose an alternative, dimensionally-dependent prior specification 
that immediately yields provable bounded convergence rates while still retaining the classical 
conditional prior specification. Thus, we show that these approaches yield the theoretical 
safeguard of geometric ergodicity so that MCMCs are still effective as a means to sample 
from modern high-dimensional posteriors, that is, even if n — o(p). 

Consider a Gibbs sampler for drawing from a posterior distribution n(9,<p \ Z), where 
9 is low-dimensional (say, 6 6 I) but cp is high-dimensional (say, cp € M p ), and where Z 
denotes the data. A two-step Gibbs sampler proceeds by drawing alternately from n(Q \ <p, Z) 
and 7r(0 | 9 , Z). Then the Markov transition density for drawing the next point based on 
the previous point (9 0 ,(p 0 ) has the form f(9,<p \ 9o,<po) = fi(9 \ 4>o) f'lifp \ 9), where for 
simplicity we suppress the dependence on Z in the notation. Now suppose we wish to prove 
a minorization condition in order to apply the result of losenthal ( .991 ). Then it suffices to 
find a density g(9, cp) and e > 0 such that f(9, <p \ 9 0 , cp 0 ) > £ g(9, cp) for all (6* 0 , (p) in some 
small set. Observe that g(9, cp) may be constructed by first finding a density g\{9) and e > 0 
such that 


fi{9 I 0o) >egi{9) (7.1) 

for all (p 0 in some small set, and then defining g(9,<p ) = g\(9) f 2 (4> \ 9). Thus, if the 
high-dimensional parameter is drawn in the last step of the Gibbs sampling cycle, then a 
minorization condition can be established by working only with the low-dimensional distri¬ 
bution of the other parameter. More precisely, the quantity e > 0 that appears in (7.1) 
is used to bound a low-dimensional distribution by another low-dimensional distribution. 
Hence, the convergence issue discussed in Subsection 2.4, in which the quantity e takes the 
form e = (£*) p , is thereby avoided. Note that the same principle applies for Gibbs samplers 
of more than two steps as long as the high-dimensional parameter is confined to the last step 
of the Gibbs sampling cycle. We illustrate the general approach above in the next subsection. 

7.1 Independent-Prior Regression Model 

As an example of the above technique, consider a modification of the standard Bayesian 
regression framework in (3.1) in which the joint prior on /3 and cr 2 is specified as independent, 


i.e., 7 r(/3,cr 2 ) = 7 r(/3) 7r(cr 2 ), that is, it is not specified conditionally: 


Y |/3,a 2 ~iV n (X/3,a 2 / n ), 

/3^7V p ( 0p ,A-%), (7.2) 

a 2 ~ InverseGamma(a/2, s/2), 


where X is a known n x p matrix (again with n > 5), and where the hyperparameters have 
known values A > 0, a > 2, and s > 0. Note that the improper prior 7r(cr 2 ) oc 1 /cr 2 is not 
used in the above formulation since it leads to an improper posterior when p > n. Then 
a Gibbs sampler to draw from the posterior under (7.2) may be constructed by taking an 
initial value (3q G MX and then drawing (for every k > 1) 


I Pk-i,Y 


InverseGamma ( ———, 

V 2 


\Y~Xf3k, 


I 2 + s 


fik 


<Y 


N , 


(a 


°k A \ 


(7.3) 


where A a 2 = X 7 X + Xcr 2 I p and (3 a 2 = A (j lX T Y. 


7.2 Convergence Properties &; Bounded Convergence Rates 

We now derive a quantitative upper bound for the convergence rate of the the independent- 
prior regression Gibbs sampler in (7.3). To do so, we use the aforementioned approach that 
allows us to focus on the distribution of the low-dimensional parameter cr 2 when establishing 
the minorization condition of Rosenthal ( L99 ). For every k > 0, let F/.(/3o) denote the 
distribution of (cr 2 , (3k) for the the independent-prior regression Gibbs sampler in (7.3) started 
with initial value (3q, and let F denote the stationary distribution of this chain, i.e., the true 
marginal posterior of (cr 2 , /3). Then we have the result shown below. ( We will preserve the 
notation of losenthal ( 1995) as closely as possible with a subscript R added, e.g., Xr is the 
quantity called simply A by Rosenthal and is unrelated to the quantity we have called A 
elsewhere in the paper.) 

Theorem 7.1. For any 0 < a < 1 and any dR > 26r/(1 — A r), 
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for every k > 1, where 



s 


(n+a)/2 


dft + S 


In particular, the constants A r, bn, cLr, and £r are functionally independent of p. 

Note that the quantities governing the convergence rate in Theorem 7.1 do not depend 
on the design matrix X or on the parameter dimension p. Thus, for any given fixed sample 
size n, the convergence complexity of the Markov chain is not affected by taking p —> oo. 
This result is stated formally in the corollary below. Let n be fixed, and suppose we have a 
sequence of n x p covariate matrices X p and a sequence of p x 1 vectors /3o :P . Let F p ^(/3o, P ) 
denote the distribution of (cr^,/3fc) for the the independent-prior regression Gibbs sampler 
in (7.3) with initial value /3o, P , and let F p denote the stationary distribution of this chain. 
Then we have the following result. 

Corollary 7.2. For the independent-prior regression Gibbs sampler in (7-3), there exist 
m (X p {3o tP ) > 0 and 0 < r < 1 such that 


dTv[^p,fc(A>), Fp] < m(X p (3 0tP )r k 


for all k andp. In particular, the geometric rate constant r is functionally independent ofp. 

By Corollary 7.2 above, there exists a single geometric rate constant r that holds for 
all p. Moreover, note that if the sequence of n x 1 vectors X p /3o yP is bounded uniformly in p 
(as would be the case for the starting point /3 0tP = 0 P , for example), then the multiplicative 
factor m(X p /3o tP ) in Corollary 7.2 is also bounded uniformly in p. 

Remark. Recall from Section 3 that the convergence rate of the Gibbs sampler for standard 
regression tends to 1 as p/n —> oo. The Gibbs sampler for independent-prior regression is 
therefore fundamentally better in this “large p, small n" regime. In particular, the Gibbs 
sampler above has two important consequences: (i) it yields a proof of concept in which 
we can prove a bounded convergence rate using the method of Rosenthal (1995), and (ii) it 
gives an example that establishes prior specification as a possible solution to convergence 
problems in high dimensions (at least when aiming to prove bounded geometric convergence 
rates). 

7.3 Dimensionally-Dependent Prior Specification 

The above independent-prior analysis leads to the question of whether bounded convergence 
rates can be obtained while retaining the conditional-prior specification. In this subsection 


30 





we show that this is indeed the case. In particular, suppose we take the model and priors as 
follows: 


Y \ (3, a 2 ~ N n (X/3,a 2 I n ), 

P | ex 2 ~ N P (0 P , A^V 2 / P ), 

a 2 ~ InverseGamma([pe]/2, s/2), 


(7.4) 


where [ •] denotes the ceiling function, X is a known n x p matrix (again with n > 5), 
and the hyperparameters have known values A > 0, e > 2, and s > 0. Then a Gibbs 
sampler to draw from the joint posterior under (7.4) may be constructed by taking an initial 
value (Tq > 0 and then drawing (for every k > 1) 


Pk I ol- i,Y ~ Nj 


(p, o\_ x A 1 


I Pk,Y ~ InverseGamma< 


n + p + \pe\ 1 
2 ’ 2 


Pk ~~ P) A l Pk — p) + C + s 


, (7.5) 


where A = X 1 X + AJ P (which is positive-definite), P = A 1 X 1 Y, and C = (I n — 

XA~ 1 X T )Y. 

The convergence properties of the Gibbs sampler above can be obtained using the results 
previously established in Section 3 for the standard regression Gibbs sampler in (3.2) since 
their basic form is the same. For every k > 0, let Fpap) denote the joint distribution of 
( Pk , &D for the Gibbs sampler in (7.5) for regression under the dimensionally-dependent 
prior started with initial value ctq. Let F denote the stationary distribution of this chain, 
i.e., the true joint posterior of (P, cr 2 ). Then we have the following result. 

Theorem 7.3. Consider the Gibbs sampler in (7.5) for the regression model under the 
dimensionally-dependent prior. Then there exist 0 < Mi < M 2 such that 


Mi 


P 


n + p + \pe\ — 2 


< dry \Fi- (cr/), F] < M 2 


P 


n + p + \pe | — 2 


for every k > 0. Moreover, the geometric rate constant r — p/(n + p + \pe\ — 2) is bounded 
above by 1/(1 + e:) for all n and p. 

Thus, the dimensionally-dependent prior specification in (7.4) provides an alternative 
approach to obtaining bounded convergence rates that preserves the conditional prior spec¬ 
ification in which Var (P \ cr 2 ) oc a 2 . 

We now show that the dimensionally-dependent prior specification in (7.4) can also be 
used where it is very relevant, that is, in high-dimensional Bayesian model selection (see 
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Section 4). Figure 5 shows analogous results of Gibbs sampling runs as in Figure 2 for the 
Bayesian lasso (left), Bayesian elastic net (center), and spike-and-slab prior (right). Here the 
prior on a 2 has been taken as InverseGamma(p/2, 1/2), with all other settings the same as in 
Figure 2. Remarkably, the dimensionally-dependent prior specification does yield bounded 
autocorrelations, thus yielding a workable solution for the use of model selection priors in 
high dimensions. 
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Figure 5: Autocorrelation of the a“l chain versus p/(n + 2p — 2) for the Gibbs sampler for 
the Bayesian lasso (left), Bayesian elastic net (center), and the spike-and-slab prior (right) 
under the dimensionally-dependent prior specification where a 2 ~ InverseGamma(p/2, 1/2). 
All other settings are the same as in Figure 2. 

Note that the above analysis shows that in principle one could choose the degrees of 
freedom in the prior specification of a 2 in order to obtain a desired geometric convergence 
rate. Hence, the prior can be specified in such a way that that the resulting Markov chain 
achieves convergence to within a given tolerance £ in a desired number of iterations. 

8 Discussion and Conclusions 

The preceding sections presented results on the convergence properties of various Gibbs 
samplers in high-dimensional regimes for important classes of statistical models. Although 
in some cases we consider simplified model and prior combinations, convergence rates can be 
extended to more sophisticated models, as demonstrated in Section 4. We now summarize 
the results in the paper and discuss their implications for high-dimensional MCMC and 
Bayesian inference in both theory and practice. 
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8.1 Summary of Results on Convergence Rates 

Sections 3, 4, 5, 6, and 7 considered the convergence complexity of the Gibbs samplers for 
several key models. The results are summarized in the table in Supplemental Section G. 
In particular, there are three important conclusions that can be drawn from our analysis 
in this paper. First, many MCMC schemes for popular models enjoy bounded geometric 
convergence rates. This property gives safeguards regarding the effectiveness of using stan¬ 
dard MCMCs in modern high-dimensional settings and is a welcome message. Important 
examples include multivariate mean models, hierarchical models with known variances, re¬ 
gression models when p = 0(n), and graphical models with bounded vertex degree. Thus, 
by and large, and contrary to what is generally perceived, convergence of high-dimensional 
MCMCs is achieved in many models, and even when not, there are possible solutions. In¬ 
deed, even in problematic cases, we have been able to resolve the convergence issue. Second, 
the Gibbs samplers for Bayesian analysis of some commonly used models have a convergence 
rate that can be arbitrarily close to 1 in high-dimensional regimes. An important case of 
this phenomenon is the class of regression-type models when n = o(p) and when the usual 
(conditional) prior specification is used. Third, the convergence complexity of the Gibbs 
sampler corresponding to a particular model can differ substantially from that of a similar 
one, i.e., slight changes to the model or prior can lead to very different convergence behavior. 
A case in point is the normal hierarchical model, in which the known-variance version enjoys 
convergence rates bounded away from 1 while the unknown-variance version does not. 

Though the mechanics of the convergence behavior of these chains are difficult to predict 
beforehand, there are nevertheless some patterns that can be observed. Models often feature 
one or more nuisance parameters that tie together a large number of other parameters. 
Typical Bayesian practice would often be to take such parameters as unknown with some 
uninformative prior. However, in high dimensions, such an approach can lead to extreme 
posterior dependence between parameters due to the structure of the likelihood or conditional 
priors of other parameters. This dependence can dramatically worsen the convergence rate 
of associated Gibbs samplers. 

8.2 Remedies for Potential Convergence Problems 

When convergence problems arise due to stochasticity of the nuisance parameters in the 
manner discussed in the previous susbection, the most straightforward solution is simply to 
take these nuisance parameters as known. Using empirical Bayes to specify the nuisance pa¬ 
rameters is a viable option. However, if such parameters must be taken as unknown, then an 
intermediate approach is to take strongly informative priors for these parameters. Of course, 
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such specifications still require the practitioner to supply prior knowledge of these parame¬ 
ters’ (approximate) values. Empirical Bayes can once more be very useful in such instances. 
Specifically, hybrid methods that combine empirical Bayes with dimensionally-dependent 
hyperparameters can enjoy the superior convergence complexity of the known-parameter 
approach while retaining the obvious inferential benefits of allowing these parameters to be 
stochastic. Clearly there is often a trade-off between convergence complexity and other goals 
of Bayesian inference. 

More generally, a variety of practical methods have been proposed for improving the 
convergence rate of Markov chains used in Bayesian inference in the classical regime where n 
and p are fixed and p is small. Such methods may involve reorganization of the structure of 
the actual sampling steps by grouping or collapsing (see, e.g., Liu et ah, 1994), reparametriza- 
tion of the model (see, e.g., Gelfand et ah, 1995; Roberts and Sahu, 1997; Papaspiliopoulos 
et ah, 2007; Yu and Meng, 2011), or expansion of the parameter space by methods such as 
PX-DA ( fiu and Wu, 1999). It is well established that these methods can indeed reduce 
the value of the constant associated with the geometric convergence in some settings for any 
particular fixed values of n and p. However, what is less clear is whether such techniques 
can qualitatively alter the behavior of a chain in terms of convergence complexity in various 
n and p regimes. More precisely, it is essential to determine whether there are settings and 
regimes where r n _ p —> 1 for some basic chain, but where the use of one of these convergence 
acceleration methods can instead yield a chain for which r U:P is bounded away from 1. Such 
questions are topics to be investigated in forthcoming work. Another potentially useful tool 
for diagnosing and understanding convergence complexity from a practical point of view is 
the dimensional autocorrelation function, or DACF, plots introduced in Figure 3. These 
plots can potentially provide insight into the convergence complexity of a Markov chain as 
a function of dimension and sample size. Additionally, when slow convergence is encoun¬ 
tered by MCMC practitioners in any given application, a DACF plot could potentially aid in 
determining whether the problems are due to convergence complexity issues or some other 
cause. 

From a theoretical standpoint, we were able to establish convergence rates that are 
bounded away from 1 in important classes of models. Even in the problematic regression 
setting, the ability of MCMC as an efficient tool to sample from the posterior was demon¬ 
strated, including in settings where the dimension is larger than the sample size. Novel 
approaches are nevertheless required when the dimension grows faster than the sample size. 
Section 7 used the Gibbs sampler for an independent-prior regression approach as an exam¬ 
ple of the specific way in which the result of losenthal (1995) can still be used to obtain 
bounded convergence rates in sample-starved high-dimensional settings. However, it seems 
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that there may only be certain cases in which such an approach can provide a bound that is 
sharp enough to permit analysis in various n and p regimes. In this paper, we have tried to 
overcome this problem by using a “first principles” approach by considering various classes 
of important models and analyzing the convergence behavior of the corresponding Gibbs 
samplers. It would be useful to generalize this strategy. We therefore hope that one conse¬ 
quence of our work will be to motivate the proposal and development of new ideas analogous 
to those of Rosenthal that are suitable for high-dimensional settings. 
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Supplemental Sections 

A Preliminaries 


If P and Q are both distributions on M D , then the Wasserstein distance between P and Q is 


D 


d w (P, Q ) = inf E{\\X - V||i) = inf E ( ~ Y i\ ), 

G=i 


where the inhmum is taken over all joint distributions of D -dimensional random vectors X 
and Y with respective marginal distributions P and Q ( Wasserstein, 1969). The Wasserstein 
distance may be equivalently defined as 


d w (P, Q ) = sup 



where the supremum is taken over all functions h such that h(x, y ) < ||ic — 2/|| 1 for all x, y e 
M. D ( Izulga, 1983). Note that this distance is sometimes called the first Wasserstein distance 
since it can also be generalized by replacing the I\ norm (in either definition) with some 
other norm. Gibbs and Su (2002) provide further discussion of the relationships between 
Wasserstein distance, total variation distance, and other distances between distributions. 


Proof of Lemma 2.2. We consider the two asymptotic regimes separately. 

Case I: Fixed p, increasing n. Since det A > A p , it is clear that 5 < 2 n e n / 4 . Thus, 
as n —> oo, 5 —> 0, so f = 1 — 5 —> 1. In fact, f —>• 1 exponentially fast as n —> oo. Now 
suppose that the bound in (2.4) is used to calculate a number of iterations K n e that will 
yield convergence to within a given tolerance e > 0. If n is large, then 5 is small, and hence 


log(e/M) 


Kn " ~ log(l - S) 


-log (M/e) > (2e 1/4 ) n log (M/e) = exp 
o 


n I - + log 2 


log (M/e). 


Thus, the number of iterations required for convergence based on the bound in (2.4) grows 
exponentially in the sample size. 

Case II : Fixed n, increasing p. Suppose p > n, and let X = UClV T , where the n x n 
matrix U and the p x p matrix V are orthogonal with columns Ui,... , u n and Vi,... ,v p 
(respectively), and where Cl is n x p rectangular-diagonal with Cl = Diag nxp (o;i,..., u n ). 
(Note that these matrices depend on p, although we do not indicate this dependence explicitly 
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in the notation.) Then 


a ~ 1/2 = {-vn T nv T + xvv T 


Then an alternative upper bound for 5 is 

1 


- 1/2 


= v[ -wn + xip 


- 1/2 


i <exp (-- 


= exp 


= exp 


= exp 


1 

4A 


XA~ 1/2 X t Y 
1 


-1/2 


unv r v ( -n 1 n + xi p ) v T vn t u t y 


Diag ; 

1 A uf 


2u\ 


2cj 2 


--V 

9 \ 


2A cj/ + 2A 

i=i 1 


• pxp \oj 2 T 2A’' " ’ (^2 _|_ 2A ; 

ujY 


,o,...,o m t u t y 


\ 2 


! ( \ 


2 " 


< exp 

max uj t 

Y 


/ 


2X \!<*<n * J 


2 


Now observe that 69 2 ,..., 69 2 are the eigenvalues of XX T . The largest of these eigen¬ 
values is bounded below by the largest diagonal element of XX T , i.e., maxi<, < n u 2 > 
maxi<,<„ i Xfj , which is of order p. Thus, max!<i< n w? — > oo as p —> oo. Since ||T^||| 

does not depend on p, it follows that S —* 0 as p —> oo, so again 1 — 5 —)• 1. Furthermore, 
1 — 6 —> 1 exponentially fast as p —> oo. □ 

Proof of Lemma 2.f. First, note that the prior 7r(cr 2 ) oc 1/a 2 above corresponds to a = £ = 0 
in the notation of vhare and Hobert (2013). Now define 7 and b as in equations (3.9) 
and (3.10) of Khare and Hobert (2013): 
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max 


P 


n + p — 2 ’ 2 J ’ 


b 


T p(n + 2 p) p_ 

2A 2 A 2 ' 


Next, let d > 26/(1 — 7 ) as required by Proposition 4 of Chare and Hobert (2013), and then 
define e as in equation (3.16) of ihare and Hobert (2013): 


e = e- 1 ' 2 < 


I n - X(X T X Ad- 1 Ip) X X 


d(l + p 2 X 2 d) 


(n+p )/2 


Observe that d > 26/(1 - 7 ) > 26 > 2 Y T Y, and thus e < {Y T Y/d^ n+p ^ 2 < 2~ { - n+ ^/ 2 -9 0 
as either n or p tends to 00 . Now observe that Proposition 4 of Chare and Hobert (2013) 
establishes a bound r for the geometric rate constant that is at least as large as 1 — e. Then 
this bound r tends to 1 exponentially fast as either n or p tends to 00 . ~’|3t 
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B Regression Models 

Proof of Theorem 3.1. Begin by writing the Gibbs sampler in (3.2) as (for every k > 1) 


0k =/3+ v^Li W 1/2 Z t , 
, 1 


ffifc = 


v k 


Pk - P) A(p k - P) +C 


where Z k ~ N p (O p ,I p ), 
where V k ~ xl + P , 


and where all of the Z k and V k are independent. Substituting for p k yields (for every k > 1) 


cr; 


1 = tt {°l-iUk + C ), where U k ~ V k ~ xl +P , 

* k 


(B.l) 


and where the U k and V k are all independent. Note that the marginal posterior of cr 2 , and 
hence the stationary distribution of the marginal chain in (B.l) above, is 


( n (j 

<j 2 I Y ~ InverseGamma ( —, — 

1 I 2 2 


by elementary Bayesian computations (see, e.g., ( ) 'Hagan and Forster, 2010). 

We first establish the lower bound. By the results of .in et al. (1994), it suffices to 
show that r = p/(n + p — 2) is an eigenvalue of F a 2 , the forward operator of the marginal 
chain in (B.l) on the space of mean-zero, finite-variance functions on the positive half-line. 
Let p(<y 2 ) = cr 2 — C/(n — 2), which has mean zero and finite variance under the stationary 
distribution (i.e., the true marginal posterior). Then it is clear from the form of the marginal 
chain in (B.l) that 

E[p(al) | a 2 k _ x ] = n+ P p _ 2 P{°l- 1 ) 

for every k > 1. Thus, p is an eigenfunction of the marginal forward operator F a 2 with 
eigenvalue pj(n + p — 2), so the largest eigenvalue of F a 2 is at least p/(n + p — 2). Thus, 
if the joint and marginal chains are geometrically ergodic, then their convergence rate is at 
least p/(n + p — 2), and the result follows. 

We now establish the upper bound. Make the transformation Q = A L ^ 2 (P — P) and (for 
every k > 1) 6 k = A 1/2 (p k - /3), so that the Gibbs sampler in (3.2) may be written as 




cr 


2 

k 


\J 0fc_i z k , 

imt+c 

v k ’ 


where Z k ~ N p (O p ,I p ), 
where V k ~ y 2 +p , 
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for every k > 1, where all of the Z k and V k are independent. Substituting for cr k _i yields 


Ok = J — ~ Z k , where ~ N P (0 P , I p ), V*-i ~ Xn+ P > (B.2) 

for every k > 2, where all of the Z k and V k are independent. (The fact that G i differs will 
ultimately be irrelevant.) Note that the marginal posterior of G, and hence the stationary 
distribution of the marginal chain in (B.2) above, is 



°p,n i 


where t P)U denotes the p-variate t distribution with n degrees of freedom, by elementary 
Bayesian computations (see, e.g., O'Hagan and Forster, 2010). By the results of Liu et al. 
(199 ), it suffices to work with the marginal chain in (B.2). Although the specification of Gi 
differs from that of G k for k > 2, this issue clearly has no effect on the convergence rate. 
Thus, we henceforth work with a modified version of the marginal G k chain in which some 
initial value Gq G M. p is provided and the specification in (B.2) is extended to k = 1 as well 
(and where we take V 0 ~ Xn+ P to be independent of the other Z k and V k ). For every k > 1, 
let G k (Go) denote the distribution of the kth iterate of this modified version of the marginal 
G k chain with starting value Gq. Let G denote the corresponding stationary distribution (i.e., 
G is a scaled t P)U distribution). 

Now define p = ||0||!/(||0||! + C) and (for every k > 0) rj k = H^lli/dl^fclli + C)- Then 


Vk = 


W k 


W k + 1 — Vk-1 ’ 


where W k = 


J k || 2 


T4_! 


( p 

2 


n + p 


(B.3) 


for every k > 1, where all of the W k are independent and where BetaPrime(-, •) denotes 
the beta prime distribution or beta distribution of the second kind. For every k > 1, let 
H k (ji o) denote the distribution of the Lth iterate of the p k chain in (B.3) with starting 
value Vo G [0,1), noting that this distribution depends on G 0 only through ||0 O || 1 an d hence 
only through rj Q . Let H denote the corresponding stationary distribution, noting that H is 
the Beta(p/2, n/2) distribution by elementary results. Observe that for every k > 1, the 
distribution of is uniform on the unit sphere in M p and is therefore independent 

of Vk- Similarly, the posterior distribution of ||0 ||^ X G \ Y is uniform on the unit sphere in M p 
and hence is independent (a posteriori) of r/ | Y. Then it follows that dTv[G k (G 0 ), G] — 
clTv[H k (vo), H] by the properties of total variation distance. Thus, it suffices to show that 
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the 7/fc chain in (B.3) converges geometrically in total variation distance with rate constant 
no larger than p/(n + p — 2). 

Now suppose that t/ 0 ~ H , i.e., suppose that the chain is stationary. Then the geometric 
convergence rate of the 7/ fc chain in general (i.e., not under stationarity) is equal to 7 ( 7 / 0 , hi), 
the maximal correlation of successive iterates under stationarity (Liu et ah, 1994). The 
application of standard transformational techniques to (B.3) shows that the joint distribution 
of 7/ 0 and ?/i under stationarity has density 

T[(n + 2p)/2\ 7/o P_2)/2 (1 - //o) (n+p " 2)/2 r/i p - 2)/2 (l - Vl )^ n+P ~ 2)/2 

[T(p/2)] 2 T(ra/2) (1 - hohi) (n+2p)/2 

with respect to Lebesgue measure on (0,1) x (0,1). Then the joint distribution of t/ q and ?/i 
is the bivariate beta distribution of kin and Liu (2003) with parameters p/2, p/2, and n/2. 
By equation (1.6) of i kin and Liu (2003), the distribution of r / 0 and 7/1 may be written as 

ho, hi I Q ~ iid Beta^| + Q, , Q ~ BetaNegBin^|, |, 7 ^, 


f {vo ’ m \v o,hi) 


where BetaNegBin(-, -, •) denotes the beta-negative binomial distribution, i.e., the distribu¬ 
tion of Q has density 



r[(rc+p)/2]r[(p + 2g)/2] 
r(p/2) 


y_ 1 _ 

J T(n/2) q\ r[(rc + 2p + 2q)/2] 


with respect to counting measure on No, where No denotes the set of nonnegative integers. 
Then 7 ( 7 / 0 , hi) = IMhtnQ)] 2 ( hu, 2008, Lemma 2.1). We now write t/ q as fj to avoid an 
eventual conflict in notation. The joint distribution of fj and Q has density 


a ) = r [( n + h)/ 2 ] /-\0H-2?-2)/2/-| _ ~\(n+p- 2)/2 I '[(p + 2q)/2] 

1 {V,q> T(n/2) [T(p/2)] 2 {V) 1 ]> q\ 


(B.4) 


with respect to the product of Lebesgue measure on (0,1) and counting measure on No- Since 
7/ ~ Beta(p/2, n/2) marginally, the conditional density of Q \ fj with respect to counting 
measure on No is 

T(p/2) q\ 


i.e., Q | fj rs_7 NegBin(p/2, 77), where NegBin(-, •) denotes the negative binomial distribution. 
Now consider an entirely separate Gibbs sampler that draws from the joint distribution of fj 
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and Q in (B.4) by fixing a starting point Q 0 G N 0 and drawing (for every k > 1) 


Vk | Qk-i ~ Beta I — + Qk-i, 


(% + Qk-u n ), Qk | fjk ~ NegBin(|, fj^j. (B.5) 


Then [7 (fj,Q)] 2 under the joint distribution in (B.4) is equal to the convergence rate of the 
Gibbs sampler in (B.5) ( .iu et ah, 1994). Therefore the original standard regression Gibbs 
sampler has the same convergence rate as the Gibbs sampler in (B.5). Integrating out fjk 
from the joint chain in (B.5) yields 


n + p 


/ p p 

Qt I Qk-i ~ BetaNegBinl - + Q t _i, 


(B- 6 ) 


for every k > 1. The convergence rate of this marginal Qk chain equals that of the joint 
chain and hence equals that of the original standard regression Gibbs sampler. Observe that 


E{Qi | Qo — Qo) — ^ + Qo^j 


n + p \ 

) 


V \ p 2 

n + p — 2j^° 2(n+p — 2) 


by the basic properties of the beta-negative binomial distribution. Then since the marginal 
Qk chain in (B.6) is stochastically monotone, we may apply Theorem 2.1 of iaconis et al. 
(2010) with their monotone function / taken as the identity function, their A G (0,1) taken 
as A = p/(n + p — 2), and their c > 0 taken as c = 1. It follows immediately that an upper 
bound for the convergence rate of the chain in (B.6), and hence an upper bound for the 
convergence rate of the standard regression Gibbs sampler, is p/(n + p — 2). O 

We now establish a sharp bound for the dw-convergence rate of the marginal chain 
in (B.l). Note that once more we will use the behavior of the marginal chain in (B.l) as a 
surrogate for the behavior of the overall chain in (3.2). For every k > 0, let Gk(& 0 ) denote 
the distribution of for the marginal chain in (B.l) started with initial value <Tq, and let G 
denote the stationary distribution of this chain, i.e., the true marginal posterior of cr 2 . Note 
that G is simply the InverseGamma (n / 2 , C/2) distribution. Then we have the following 
result. 


Theorem B.l. For the marginal chain in (B.l) of the standard Bayesian regression Gibbs 
sampler specified in (3.2), 




p 


n + p — 2 


k 

< d w [Gfc(cro), G] 


< M 2 (ctq) 


p 


n + p — 2 


k 
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for every k > 0, where 


Mi (tfo) = 


C 


CT n 


n 


M 2 (a^) = 


c 


On 


n 


+ 


C 


n 


n 


Proof. We first establish the lower bound. Let a 2 ) = er 2 — C/(n — 2). Then it is clear from 

the form of the marginal chain in (B.l) that 


E [H a l) | a l- 1 ] = 


P 


n + p — 2 

for every k > 1, which by repeated application yields 


H a k- 1 ) 


m«i)\ = 


p 




K n + p — 2 / 

for every k > 0. Now note that tjj has Lipschitz constant 1. Then 


d w [G k (o*),G] > E[if(al)]-E[if(a 2 ) \ Y] 


C 


Un 


n — 2 


P 


n + p — 2 


for every fc > 0 since ^[^(a 2 ) | Y] — 0, which establishes the lower bound. 

To establish the upper bound, let £ be a random variable such that £ ~ G. Then 


d w [G 0 (a 2 ),G] =£(|£-a 2 |) < 


C 


- 


< 


n — 2 
C 


+ E 


£- 


c 




n — 2 


n — 2 

+ v/Var(£)=M 2 (a 0 2 ), 


noting that £'(£) = C/(n — 2). Hence the upper bound is established for k — 0. Now assume 
as an inductive hypothesis that the upper bound holds for some arbitrary k > 0. Then there 
exists a random variable £& such that ~ Gk{cr g) and 

S(l&-el)=rf»'[G i (<T2),G] <M 2 p 0 2 ) ( ra + ^_ 2 l , 

noting that the existence of a coupling that attains the Wasserstein distance is well known 
(e.g., Rachev, 1984; Givens and Shortt, 1984). Now let U ~ y 2 and V ~ X 2 +p be independent 
of £, £ fc , and each other, and define random variables £fc + i and £* according to 

fk+i = y(U fk + C), £* = —([/£ +G), 
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noting that £,k+i rs-/ G k+1 (al) and £* ~ G by construction. Then 


d w [G k+1 (a 2 0 ),G] < E( |6+i -e*l) = E {^\Zk 


—--- d w [G k (o-q) , G] 

n + p — 2 L v ' J 



establishing the upper bound for every k > 0 by induction. 


B 


Remark. The method of proof of the upper bound in Theorem B.l relies upon the estab¬ 
lishment of a coupling between the distributions of the iterates of the Markov chain and the 
stationary distribution. (Note that the word coupling here refers to a joint distribution that 
yields some specified distributions as its marginals, which differs from its usual meaning 
in the context of Markov chain analysis.) To our knowledge, this approach has not been 
previously used to construct quantitative results for Markov chain convergence. 

Remark. Note that the expression for the geometric rate constant r is sharp since the upper 
and lower bounds both lead to r = p/(n + p — 2). 

Theorem B.l applies for any particular fixed values of n and p. This sharp result allows 
us to analyze the standard Bayesian regression Gibbs sampler in (3.2) as the values of n 
and p = p n grow. Thus, we can understand the convergence of the chain in various n and p 
regimes. To do so, suppose we have a sequence of n x p n covariate matrices X n and a 
sequence of n x 1 response vectors Y n . For the sake of complete rigor, we also impose the 
following very mild assumption for the remainder of the results in this subsection. 

Assumption B.2. ||||| = 0{n). 

Now also note that we will write A n , C n , Mi^Uq), and M 2 , n (cT q) to denote the dependence 
of these quantities on n. Finally, for every n > 5 and k > 0, let G n ^ k (crl) denote the 
distribution of a\ for the marginal chain in (B.l) started with initial value (Tq, and let 
G n denote the stationary distribution of this chain. The following result now allows us to 
understand the convergence behavior of the Markov chain corresponding to the Bayesian 
analysis of the classical regression model in (3.1) in various n and p regimes. 

Corollary B.3. For the marginal chain in (B.l) of the standard Bayesian regression Gibbs 
sampler specified in (3.2), 


m i( cr o) r n < d w [G n , k (al),G n ] <m 2 (ol) r k n 
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for all k and n, where 


r n = —-— - mi(cro) = inf M l n (cr%) > 0, m 2 (c r o) = supM 2 , n (cro) < cx). 

n + p n - 2 n> 5 n>5 

Moreover, r n is bounded away from 1 if and only if p n = 0(n). 

Proof. Note that 0 < C n — Yff (I n — X n A~ l Xf()Y n < ||Y^||| = 0(n) by Assumption B.2 
and since the matrix X n A~ l X'ff is positive semidehnite. Then clearly m 2 (<7g) < oo, and the 
rest follows immediately from Theorem B.l. □ 

The message of Corollary B.3 is that although the convergence to the stationary distribu¬ 
tion is geometric, the rate constant of the geometric convergence tends to 1 if the number of 
parameters (i.e., the number of regression coefficients, or equivalently, the number of predic¬ 
tor variables) grows faster than the sample size. More practically, the results of Theorem B.l 
and Corollary B.3 may be understood by considering the number of iterations required for ap¬ 
proximate convergence. Let e > 0 be given, and let AA.e^o) denote the number of iterations 
required for the Markov chain to be within e of the stationary distribution in Wasserstein 
distance, i.e., 


K n ,e{orjj) = inf {AT > 0 : d w [G n ,fc(^o) > G A < £ for every k> K}. 

The following result asserts that if the dimension of the parameter grows faster than the 
sample size, then the number of iterations required (to obtain approximate convergence to 
within some desired distance e > 0) grows without bound. 

Corollary B.4. For the marginal chain in (B.l) of the standard Bayesian regression Gibbs 
sampler specified in (3.2), suppose that 0 < e < mi (erg). Then K nt£ (a^) = 0(1) as n oo 
if and only if p n = 0(n). 

Proof. Assume that p n = 0(n ). Then there exists r < 1 such that r n < r for all n. Now let 
K = log[e/m 2 (o'o)]/l°g( r )> an d note that for every k > K and all n, 

dw[G n ,k(vl),G n \ < m 2 {crl)r k n < m 2 ( y a( j )r k < m 2 (al)r K = e. 

Thus, K n Ja() < K for all n, so K n Jal) = 0(1). 

Now assume instead that K n ^{a'l) = 0(1). Then there exists an integer K > 0 such that 

m i(°o) r n < d w [G n , K {al) ’Gn] <£ for all n, 
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which implies that r n < [e/m^dg)] 1 ^ < 1 for all n. It follows that 

Pn < Pn _ r n [e/mi^al)} 1 ^ 

n ~ n — 2 1 — r n ~ 1 — [e/m^crg)] 1 /-^ 

for all n, and thus p n = 0(n). O 

To express the idea of Corollary B.4 somewhat more finely, we can consider the rate at 
which /^^(crg) grows with p n and n. Note from Corollary B.3 and the definition of -K’n.e^o) 
that 


logmi(ag) - log £ 
log {n + p n - 2) - log p n 


< K n , e ( o-q) < 


log 777-2 (o~q) - logs 
log (n + p n - 2) - log p n 


where [■] denotes the ceiling function. Thus, K n ^(ol) is proportional to 


[log(?r + p n — 2) — logp n ] 1 


log 1 + 


n — 2 

Pn 



Pn 

n 


for large n and p n with n < p n . Thus, the rate of growth of K n £ (aY) is asymptotically linear 
in the ratio p n /n. In particular, an increase in the parameter dimension p increases the 
number of iteratons required for approximate convergence, while an increase in the sample 
size n reduces it. As a concrete example, a hundredfold increase in the parameter dimension 
implies a hundredfold increase in the required number of iterations (holding the sample size 
constant). Hence, an especially large number of iterations may be required in the modern 
“large p, small n” setting. The above result thus questions the validity of high-dimensional 
Bayesian inference that relies on regression-type Gibbs samplers. 


Proof of Lemma 3.2. The result is obtained by a straightforward calculation of the posterior 
correlation using the conditional and marginal posteriors as well as standard properties of 
the y 2 , F, and inverse-gamma distributions. □ 

Proof of Lemma 3.3. Note that <r 2 ~ G and a| +1 ~ G since erg ~ G. Then Var(a^) = 
Var(o^ +1 ), and 


Co y{crlal +1 ) = Cov 


2 

a ki 


Vi i 


k -\-1 


-(cfcGfc+i + C) 


= Ktil) Var(<7l) = Vw G)- 
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Similarly, Var(||0 A .|||) = Vax(||0 fc+1 |||), and 
Cov(||0 fc ||jj, H^fc+illa) = Cov 


n ||2 U]~+\ 

Ok || 2 , -T7~ 

Vb 


0 fc 11 2 + C) 


= E 


Uk+i 

V 


n + p — 2 


Va r (||6> fc |||) 
Var (||0 fc |||). 


The result then follows immediately. 


□ 


Proof of Lemma 3.f. For all k > 1, we have 0fc/||0fc||2 = Zk/\\Zk || 2 , and the Zj. are inde¬ 
pendent. □ 

Proof of Theorem 3.5. For each j e {1 Theorem 3.1 immediately establishes the 

existence of constants 0 < M J: \ < Mj j2 such that 




n + Sj — 2 


< d T v, Sj] < M j>2 


n + 5j — 2 


for every k > 0. Now observe that dTv[^k(Do), S] < Y^j=\ dTv[^j,k(Djj-o), Sj] due to the 
mutual independence of the (. Djj , L pa ^ j). Then 


dTv[^k(D 0 ), 5] < y; M h 2 
3 = 1 


n + Sj — 2 


< m ( max M, 2 


n + <5 n 


establishing the upper bound with M 2 = mmax^-^M^. Now note that there exists 
J G {1,... , m} such that <5/ = <5 max . Then 


^Tv[Sfc(-D 0 ), “] > V, Mj, 1 

i=i 


n + 5j — 2 


> Mja 


n T $max 2 / 


etablishing the lower bound with M) = A4j, 1 • 


□ 


C Bayesian Model Selection 

Proof of Theorem f.l. Begin by noting that in the Gibbs sampler in (4.2), may be ex¬ 
pressed as 


= 


0fc i Uk + C. 


Tfc 


14 


where U k ~ xL 


(C.l) 
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and where U k is independent of cr^_,. Then 


n ( 2 2 \ n ( 2 a kUk +1 + Cxfc+i \ 

Cov(£7 fc , <T fc+ i) = Cov I cr fc , -—- 1 


= Cov 


/" 2 Vfc+l 2^ , ( 2 C- 

°ki T7-+ Cov 0 T fc , 


V K 14+1 


1 


V l4+i J 


Recall that f/fc+i and V k +i are independent of each other and of cr^, so 


Cov 


/ 2 ^4+1 2^ 

V*’ +7+) 


= E 

= E 


Uk+i 


tfY 


e{4)e(^ 


( Uk+i 

V^i 


Since Vk+i is also independent of r/, ;+ i, we have 


Cov 


( 4 , 


C. 


T k +1 


V K 14+1 


= E 


a 


T k+1 2 
(J b. 


14+i 

1 


E 


a 


Tk +1 


14+1 J 


) E(al) 


n + p — 2 
1 


^(^ +1 ^)-+E(c Tfc+1 )h;(^)] 


Cov (cr^, C Tfe+1 ) > 


(C.2) 


(c.3) 


n + p-2 v fc+1 ' “ n + p-2 
by the Cauchy-Schwarz inequality. Now observe that 


\/var+l) Var(C Tfc+1 ) (C.4) 


Var(C rj . +i ) < £ 0 


■y2 

'T’fc + l 


= E 


Y r (l n - XA-; + X t )Y 


1 2 


< (^) 2 , 


noting once again that Y is nonrandom from the point of view of the Gibbs sampling Markov 
chain. Combining this result with the inequality in (C.4) yields 


Cov 


(+ 


> 


Y T Y 


\ 14+1 J n+p-2 


Var K), 


which in turn may be combined with the results in (C.2) and (C.3) to obtain 


Cov(cr fc , a k+l ) > 


P 


Var (al) 


n + p — 2 

The desired result then follows from the fact that 


Y T Y 


p v / Var(cj2) 


Var(a^) = Var(c^ +1 ) = Var (a 2 | Y) 
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since a\ ~ n (a 2 | Y) and cr 2 ~ n (cr 2 | V). 


0 


D Multivariate Location Models 


Proof of Theorem 5.1. Begin by writing the Gibbs sampler in (5.2) as 


Pk = P + 


cr; 


fc-i 


n + A 


cr,. = 


(n + A)||/Xfc - MII 2 + C 
V k 


where Z k ~ N p ( 0 p , J p ), 


where 14 ~ xlp+p, 


and where all of the Z k and 14 are independent. Substituting for p, k yields 


cr 


k ~ y-^k-lUk + C), 


where U k ~ xl, V k ~ X 


X 2 


(D.l) 


and where the U k and 14 are all independent. This marginal chain is the same as the marginal 
chain in (B.l) of the standard Bayesian regression Gibbs sampler, except with the degrees 
of freedom of 14 changed from n + p to np + p. Thus, the proof is essentially identical to 
that of Theorem 3.1. □ 

We now establish a sharp bound for the djy-convergence rate of the marginal cr| chain 
in (D.l) of the Gibbs sampler for the multivariate mean model. For every k > 0, let G k (a q) 
denote the distribution of a\ for the marginal chain in (D.l). Let G denote the stationary 
distribution of this chain, i.e., the true marginal posterior of cr 2 . Then we have the following 
result. 

Theorem D.l. For the marginal chain in (D.l) of the Gibbs sampler for the multivariate 
mean model, 


Mi (a 2 ) 


P 


np + p — 2 


< d w [C4(ctq), G] < M 2 (ag) 


p 


np + p — 2 


for every k > 0, where 


Mi (a 2 ) = 


C 




np — 2 


M 2 (a 2 ) = 


C 


0) - 


np — 2 


+ 


C 


np — 2 V np — 4 


Proof. The proof is essentially identical to that of Theorem B.l. 


□ 
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E Normal Hierarchical Model 


Proof of Theorem 6.1. Begin by writing the Gibbs sampler in (6.2) as (for every k > 1) 


■0 k,i = 


Pk = 


T 


a 2 + t 
1 
n 


2 X * 


G 


a 2 + r : 


;Pk—i 


g 2 t 2 
a 2 + t 2 


■ k.i-) 




i= 1 


— Zk, 

n 


where Y k>i ~ N P (0 P , I p ), 
where Z k ~ N p (O p ,I p ), 


where i G {1,..., n}, and where all of the Y k) i and Z k are independent. Substituting for 
yields (for every k > 1) 


Pk = 



Pk -1 + 


Pk-1 + 


I G 2 T 2 

a 2 + t 2 


n 




i =1 


+ \l- z k 

n 


(r 2 ) 2 + 2cr 2 r 2 
n(a 2 + r 2 ) 


W fc , 


(E.l) 


where W k rv_/ iV p (O p ,/ p ) and where the W k are all independent. Note that the stationary dis¬ 
tribution of this chain (i.e., the true marginal posterior of fi) is the p-dimensional multivariate 
normal distribution with mean vector n -1 E" =1 and covariance matrix n~ l (a 2 + r 2 )J p . 
Now define fi = /i — n~ 1 E"=i and Pk = Pk ~ n _1 E™=i If I s clear that the total 
variation distance between the distribution of fi k and the marginal posterior of fi is the same 
as that between the distribution of and the marginal posterior of [i. Thus, it suffices 
to prove the result in the case where n _1 = 0 p , which we will henceforth assume. 

Then for every k > 1, 


Pk = 


g 


a 2 + r : 


;Pk -1 + 


(r 2 ) 2 + 2 g 2 t 2 
n(o 2 + t 2 ) 


W k , 


which implies that 


Pk = 


a 


a 2 + t z 


Po + 


2^2 


a 


= r po + 


a 2 + r 


n 


t 2 ) 2 + 2 g 2 t 
n(o 2 + t 2 ) z ^Vcr 2 + r 2 
1/2 


E 


Z—1 


n 1/2 




(l — r 2fe ) 


W fc , 
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where W k ~ N p (O p , I p ) and r = cr 2 /(cr 2 + r 2 ). Hence, 




N„ 


rVo, 


cr 2 + r 2 


n 


(i 



(E.2) 


for every k > 0. Now note that d T y[H k (^i 0 ), H] is at least as large as the d TV -distance 
between the N p [r k 11 0 , n^ 1 (cr 2 + t 2 )/ p ] and N p [O p , n~ 1 (a 2 + t 2 )I p \ distributions. Then by ele¬ 
mentary properties of the total variation distance between multivariate normal distributions, 
we have 


d,Tv[H k (no), H] > J 


n 


2 (a 2 + r 2 ) 


Mo 2 f 


for all sufficiently large k. To establish the upper bound, let H k (no) denote the p- variate 
normal distribution with mean r k fi o and covariance matrix n _1 (cr 2 + t 2 )I p . Then 


d T \r[H k (/j, 0 ), H] < d TV H k (/j, 0 ),H 

1 1/2 


T d< 


< 


n 


2(a 2 + r 2 ) 


P 


TV 


IMoIU r 


HkifJ* o), 0 ) 


7r(l — r 2k ) 


log 


1 — r 


2k 


(1 - r 2k ) 


2k\ 1 / r2 


1 1/2 


1 - 


-2k 


< 


n 


iMoll 9 r K + pr 2k < 


2 (a 2 + r 2 ) 


n 


cr 2 + t- 


IMo || 2 r 


for all sufficiently large k. 


□ 


Remark. The chain Hk is linear in the previous iterate. Thus, the sharp bound above 
can also be obtained by evaluating the maximal correlation. (Recall that Theorem 3.1 for 
obtaining sharp rates for the standard regression model was obtained using the maximal 
correlation method.) 


We now establish a sharp bound for the dyr -conver g ence rate of the marginal pL k chain 
in (6.2) of the Gibbs sampler for the normal hierarchical model. 

Theorem E.l. Consider the Gibbs sampler for the normal hierarchical model in (6.2). Then 

M 1 ( t i 0 )r k < l - d w [H k (n 0 ),H} < M 2 (hq) r k 
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for every k > 0, where r = a 2 /(cr 2 + r 2 ) and 


Mi ( Mo ) — ~ 
P 


Vo 


n 


V -Y, 


2=1 


M 2 (m o) = - 
V 


Vo 


n 


V-v 


2=1 


2(a 2 + r 2 ) 


U7T 


Proof. Begin by noting that Hk(vo) was derived in (E.2) in the proof of Theorem 6.1. The 
lower bound then follows immediately from a comparison of the means of the distributions 
H k (v o) and H. 

To establish the upper bound, let £ be a random variable such that £ ~ H. Then 


d w [H 0 {vo),H] = £(||£ - Molli) < IlMolli T-Edl^lli) = ||Mo||i +P \j= P M i(.V o), 

establishing the upper bound for k = 0. Now assume as an inductive hypothesis that the 
upper bound holds for some arbitrary k > 0. Then there exists a random variable £ k such 
that £ k ~ H k (no) and 


E(Uk ~ £U = d w [H k (v o), H) <pM 2 (v o) r\ 


noting that the existence of a coupling that attains the Wasserstein distance is well known 
(e.g., Rachev, 1984; Givens and Shortt, 1984). Now let W ~ N p (O p ,I p ) be independent of 
£ and and define random variables £ k+ i and according to 


£fc+i = r £ k 


(r 2 ) 2 + 2cr 2 r 2 
n(a 2 + r 2 ) 


W, 


Z* = r£ 


( t 2 ) 2 + 2<7 2 T 2 
n(a 2 + t 2 ) 


W, 


noting that £ k +i ~ H k+ i(Vo) and ~ H by construction. Then 


d w [H k+1 (vo),H]<E(\\£ k+1 -£4 1 )=rE(U k -(i\\ 1 )<pM 2 (vo)r k+ \ 


establishing the upper bound for every k > 0 by induction. □ 

Remark. Note that Theorem E.l is stated with the Wasserstein distance and the i\ norms 
multiplied by a factor of 1/p. This factor is introduced to adjust for the fact that the l\ norm 
(on which the Wasserstein distance is based) in p dimensions is a sum of p terms. 
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F Bounded Geometric Convergence Rates for High- 
Dimensional Regression 

Proof of Theorem 7.1. We shall use Theorem 12 of Rosenthal (L99 ), which requires the 
establishment of a drift condition and an associated minorization condition. We use a sub¬ 
script R to correspond to the notation of Rosenthal (1995) when necessary to avoid a conflict 
with the notation in the remainder of the present work. Also note that some expressions 
below require the existence of a cTq > 0 in addition to f3 0 G but this cTq may be taken 
arbitrarily since doing so does not affect the chain in any way. 

Let Vr(ct 2 ,/3) = ||Y — X/3|||, and observe that 

Y-Xfa | crl ~N k (y~ XP al , ajXAffX^. 

(It is to be understood throughout the proof that all distributions and expectations are 
conditional on Y.) Now let X = UflV T , where U and V are orthogonal with columns 
ui,... ,u n and Vi,... ,v p (respectively), and where fl is n x p rectangular-diagonal with 
f2 = Diag nxp (o;i,..., oj n ). (Note that these matrices depend on p, although we do not 
indicate this dependence explicitly in the notation.) Then 


Y-X/3 1 


2 

< 7 1 ~ 


N n 


U I n -V, 


U t Y, afU^^U 1 


where 


= Dia S 


uj T 


c ol 


+ A of + Act 2 


Then for all aj > 0, 


E[V R (al^) | a?] 


and hence 


u[l n -^^U T Y 

(i n -^y Y u; 


2 + tr (a 2 1 UV a *U T> ) 

+ of tr^^ < Y J Y + ncr 2 , 


E[V R {al^)\ =E{E[V R {alf3 1 ) 
< Y T Y + nE(a{) 


k 2 ]} 

= yIV + ^ 2 (ll y - XAII » + > )- 
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Thus, the drift condition of losenthal (199 ) holds with Vr(< 7 2 ,/3) as given above, with 
constants Xr = n/(n + a — 2) and bn = Y 1 Y + ns/{n + a — 2). 

We now establish an associated minorization condition. Let cIr > 26r/(1 — A r), and 
suppose Vr(ct q,/3 0 ) = ||Y — X/3 0 ||2 < d r . Let f(a 2 ,(3 \ (Xq,/3 0 ) denote the density with 
respect to Lebesgue measure of the joint distribution of the (k + l)st iterate given that the 
A; tli iterate takes the value (<7q,/3o)- This density may be expressed as 

f(a 2 ,{3 | u 2 ,/3 0 ) = fp\AP | (x 2 ) fa |/3 0 (cx 2 I AO- 


Now let Qr be the I river seG amnia [ ( n +a) / 2, (dR+s)/ 2] distribution, and let qn be its density 
with respect to Lebesgue measure. Then 


/<r 2 |/3 0 ( cr2 I Po) — 


> 


\Y-Xp 0 \\ 2 + s)/2} 

T[(n + a)/2] 

\\Y - X(3 0 \\ 2 + s 
dR + s 


(n+a)/2 


+) 


— (n+a+2)/2 


exp 


\Y-Xp 0 \\l + s 


2 (T 2 


(n+a )/2 


Qr(v 2 ) > 


dR + s 


(n+a )/2 


qR(cr 2 ). 


Thus, the minorization condition of Rosenthal (1995) holds with Qr and d+{ as given above, 
with £r = [s/ipR + s)]i n+a )/ 2 . The result then follows immediately from Theorem 12 of 
Rosenthal (1995), noting also that the quantity of losenthal (1995) may simply be 

omitted while still preserving an upper bound. □ 

Proof of Corollary 7.2. For any 0 < a < 1, let rpa) = (1 — £r)°, and let r 2 (a) equal the 
quantity in square brackets on the right-hand side of Theorem 7.1. Note that (1 + 2b r + 
\Rd R )/(l + d R ) < 1 since d R > 2b R /(l — X R ), so it follows that there exists 0 < a* < 1 such 
that r 2 (a*) < 1. Then the result clearly holds with r = max{ri(a*), r 2 (a*)}, noting that 
this quantity does not depend on p, X p , or /3 0tP . O 

Proof of Theorem 7.3. Begin by writing the Gibbs sampler in (7.5) as (for every k > 1) 


P k = P+ Jo 2 A~^ 2 Z k , 


a,. = 


V k 


{p k -~p) T A{p k -p)+C + s 


where Z k ~ N p (0 p , I p ), 
where V k ~ xl +v+ \ P e\, 


and where all of the Z k and V k are independent. Substituting for (3 k yields 


a l ~ tt K-iUk + C + s ), 

Vk 


where U k ~ y 2 , V k ~ xl+ P +\ P eV ( R1 ) 
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and where the L4 and V/ arc all independent. This marginal chain is the same as the marginal 
chain in (B.l) of the standard Bayesian regression Gibbs sampler, except with the degrees 
of freedom of 14 changed from n + p to n + p+ \pe \. Thus, the proof is essentially identical 
to that of Theorem 3.1. □ 


G Summary of Convergence Results 


Family 

Model 1 

Gibbs Steps 

Dimension 

Convergence Rate 


Standard 

2 

p, 1 

= p/(n + p - 2) 

Regression 

Independent-Prior 

2 

p, 1 

does not depend on p 

Dimensionally-Dependent 

2 

p, 1 

= p/(n + p + \pe\ - 2) 


Lasso-Type 

3 

p, 1, p 

~ p/(n + p — 2) 

Location 

Location 

2 

np, 1 

~ p/ (np + p — 2) 


Known Variances 

2 

np, p 

= a 2 /{a 2 + t 2 ) 1 

Hierarchical 

Unknown Variances 

4 

np, p, 1, 1 

— * 1 as n, p — » 00 


Unknown Variances, 
Dimensionally-Dependent 

4 

np, p, 1, 1 

1 as n, p — > 00 


H Details of Numerical Results 

Each point in the plots of Figures 2 and 3 represents the average lag-one autocorrelation 
over 10 Gibbs sampling runs of 10,000 iterations each. The quantities, vectors, and matrices 
used in each model are described separately below. 

For the regression-type models in Figure 2, chains were executed for each combination 
of values of n G {10,30,100} and p G {10,30,100}. For each of the 10 runs at each n 
and p setting, the np elements of the n x p covariate matrix X were drawn as independent 
N( 0,1) random variables. Also, for each run, the n x 1 response vector Y was generated as 
Y = Xfa + e, where /3* is a p x 1 vector with its first p/2 elements drawn independently as 
±1 with probability 1/2 each and its remaining p/2 elements set to zero, and where e is an 
n x 1 vector of independent £4 random variables multiplied by 1/2. The initial values were 
set as /3() = l p and ctq = 1. For the Bayesian lasso, the regularization parameter A was set to 
A = 1. For the elastic net, both regularization parameters Ai and A 2 were set to Ai = A 2 = 1. 
The spike-and-slab prior used Q — 1/n and Kj = 10 for all j G {1,... ,p}. 
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For the hierarchical models in the left side of Figure 3, chains were executed for each 
combination of values of n G {10,30,60,100,150,210} and p G {3,10,30,100,300}. For 
the hierarchical models in the center and right side of Figure 3, chains were executed for 
each combination of values of n G {5,15,25,35,45} and p G {5,15,25,35,45}. For each 
of the 10 runs at each n and p setting, the np elements of the matrix X were drawn as 
independent 1 4 random variables multiplied by 1/2. The initial values were set as no — Ip 
and cTq = Tq = 1 . 


20 


